Bayesian testing

Next: Approximations to BF: choice Up: Bayesian methods Previous: Bayesian methods Contents

Bayesian testing

Lindley (1957) [244] Lindley's paradox: For a test of
$H_{0}:\theta=\theta_{0}$ vs. $H_{1}: \theta \neq \theta_{0}$ which is just significant at a fixed level, the posterior probability $P(H_{0}|y)$ grows with $n$ . Discussion of reasons of this, priors and different null hypotheses: whether we believe they can be true, i.e. $\theta_{0}$ is fundamentally different from any $\theta\ne \theta_{0}$ . Effect of sample size on p-value vs. $P(H_{0}|y)$ . In a comment, Bartlett (1957) [27] discusses the possibility of letting the sample size depend on the distance between null and (single) alternative, implicitly raising the possibility of letting the prior depend on sample size. He also points out an error; the corrected result shows that the prior may not be improper.

Edwards et al. (1963) [118] A review of Bayesian statistics (with a personal probability point of view) for a psychological audience. Includes a long section on hypothesis testing, discussing the often contradictory results of classical and Bayesian testing.

DeGroot (1973) [106] Describes specific classes of alternative hypotheses and prior distributions, for which p-value is approximately equal to the probability that the alternative hypothesis is true.

Dickey (1977) [111] Considers whether the p-values is, in general, a good approximation of the Bayes factor (it is not).

Shafer (1982) [334] Nice description of Lindley's paradox [244] and treatment of the same situation using the theory of belief functions. Example from forensic science (comparing refractive indices of fragments of glass) and discussion of the empirical prior in it. In discussion, see e.g. De Groot, Good and Hill.

Berger and Sellke (1987) [35] Comparing p-values and BF for point null hypotheses
$H_{0}: \theta=\theta_{0}$ vs. $H_{1}: \theta\neq \theta_{0}$ . Computes lower bounds for $p(H_{0}|x)$ for different classes of priors (for $\theta$ under $H_{1}$ ) and shows that this is (in the examples considered) always higher than the corresponding p-value, i.e. p-value overstates the evidence against $H_{0}$ . C.f. Casella and Berger (1987) [74]. In discussion, further comments on p-values vs. Bayesian tests, the role of point null hypotheses etc. (see e.g. Hinkley on model selection and Vardeman on priors implied by point nulls).

Casella and Berger (1987) [74] Same problem as in Berger and Sellke (1987) [35], but now for the one-sided hypothesis $H_{0}: \theta\leq 0$ vs. $H_{1}: \theta>0$ . In this case p-values $p(x)$ and Bayesian tests can be `reconciled', i.e. $\inf P(H_{0}|x)\leqp(x)$ for a very large class of priors. Argues that a point prior cannot be regarded as `impartial' because it concentrates probability mass at a point. [Note that this is a different Berger than in [35]]

Verdinelli and Wasserman (1996) Shows that under some conditions BF for a precise null hypothesis approximates BF for an `imprecise' (i.e. short interval) hypothesis, even in the presence of nuisance parameters. Suggestions of priors based on the resuly.

Weakliem (1998b) [380] Significance levels in classical and Bayesian (BF) hypothesis testing. For sharp null hypotheses ( $\theta=0$ vs. $\theta\neq 0$ ) BF is generally more conservative than classical tests: a just significant result for a classical test is typically only weak evidence for the alternative hypothesis or could even be evidence against it (with appropriate choice of prior BF can be arbitrarily strongly in favour of the null). To reduce the disagreement, W proposes the use of one-sided hypotheses ( $\theta>0$ vs. $\theta<0$ ). Decision criteria tested using data from GSS: tests applied to a subsample and conclusions assessed by comparing to results in the full data. Tests of sharp hypotheses do less well (conclusion changes, most often unll hypothesis not rejected in small sample and rejected when more data become available) than one-sided hypotheses (conclusion changes rarely). C.f. [74]. [It is not clear to me how this would be applicable to model selection. If (paraphrasing) `no effects are ever zero', should we always entertain the largest possible model everywhere. This is again related to the question of the nature of (sharp) null hypotheses (interesting in themselves or not).]

Efron and Gous (1998) [123] Scales of evidence for significance tests and BF, with an attempt to `reconcile' the two. Priors coincide with conventional rules for significance tests; uses an approximate formula $\baf(D)\approx L^{2}(D)/L^{2}(D_{0})$ where $D_{0}$ is hypothetical data such that $\baf(D_{0})=1$ . Discussion of Bayesian `sample size coherency' (problem specification, including the prior, should stay the same for all $N$ ). with arguments for and against.

Aitkin (1998) [5] Comment on Lindley's example in the discussion of Aitkin (1991) [4], pointing out a connection to Simplson's paradox. XXX

Next: Approximations to BF: choice Up: Bayesian methods Previous: Bayesian methods Contents

Jouni Kuha 2003-07-16