Lindley (1957) [244]
Lindley's paradox: For a test of
$H_{0}:\theta=\theta_{0}$
vs.
$H_{1}: \theta \neq \theta_{0}$
which is just significant at a fixed level, the
posterior probability $P(H_{0}|y)$
grows
with $n$
. Discussion of reasons of this,
priors and different null hypotheses: whether we believe they can be
true, i.e. $\theta_{0}$
is fundamentally
different from any $\theta\ne \theta_{0}$
.
Effect of sample size on p-value vs.
$P(H_{0}|y)$
. In a comment, Bartlett (1957)
[27] discusses the possibility of letting the sample
size depend on the distance between null and (single) alternative,
implicitly raising the possibility of letting the prior depend on sample
size. He also points out an error; the corrected result shows that the
prior may not be improper.
Edwards et al. (1963) [118] A review of Bayesian statistics (with a personal probability point of view) for a psychological audience. Includes a long section on hypothesis testing, discussing the often contradictory results of classical and Bayesian testing.
DeGroot (1973) [106] Describes specific classes of alternative hypotheses and prior distributions, for which p-value is approximately equal to the probability that the alternative hypothesis is true.
Dickey (1977) [111] Considers whether the p-values is, in general, a good approximation of the Bayes factor (it is not).
Shafer (1982) [334] Nice description of Lindley's paradox [244] and treatment of the same situation using the theory of belief functions. Example from forensic science (comparing refractive indices of fragments of glass) and discussion of the empirical prior in it. In discussion, see e.g. De Groot, Good and Hill.
Berger and Sellke (1987) [35]
Comparing p-values and BF for point null hypotheses
$H_{0}: \theta=\theta_{0}$
vs.
$H_{1}: \theta\neq \theta_{0}$
. Computes
lower bounds for $p(H_{0}|x)$
for different
classes of priors (for $\theta$
under
$H_{1}$
) and shows that this is (in the
examples considered) always higher than the corresponding p-value, i.e.
p-value overstates the evidence against
$H_{0}$
. C.f.
Casella and Berger (1987) [74]. In
discussion, further comments on p-values vs. Bayesian tests, the role
of point null hypotheses etc. (see e.g. Hinkley on model selection and
Vardeman on priors implied by point nulls).
Casella and Berger (1987) [74]
Same problem as in Berger and Sellke (1987)
[35], but now for the one-sided
hypothesis $H_{0}: \theta\leq 0$
vs.
$H_{1}: \theta>0$
. In this case p-values
$p(x)$
and Bayesian tests can be
`reconciled', i.e.
$\inf P(H_{0}|x)\leqp(x)$
for a very large class of priors. Argues that a point
prior cannot be regarded as `impartial' because it concentrates
probability mass at a point. [Note that this is a different Berger than
in [35]]
Verdinelli and Wasserman (1996) Shows that under some conditions BF for a precise null hypothesis approximates BF for an `imprecise' (i.e. short interval) hypothesis, even in the presence of nuisance parameters. Suggestions of priors based on the resuly.
Weakliem (1998b) [380]
Significance levels in classical and Bayesian (BF) hypothesis testing.
For sharp null hypotheses ($\theta=0$
vs. $\theta\neq 0$
) BF is
generally more conservative than classical tests: a just
significant result for a classical test is typically only weak evidence
for the alternative hypothesis or could even be evidence against it
(with appropriate choice of prior BF can be arbitrarily strongly in
favour of the null). To reduce the disagreement, W proposes the use of
one-sided hypotheses ($\theta>0$
vs. $\theta<0$
). Decision criteria
tested using data from GSS: tests applied to a subsample and
conclusions assessed by comparing to results in the full data. Tests of
sharp hypotheses do less well (conclusion changes, most often unll
hypothesis not rejected in small sample and rejected when more data
become available) than one-sided hypotheses (conclusion changes rarely).
C.f. [74]. [It is not clear to me how this
would be applicable to model selection. If (paraphrasing) `no effects
are ever zero', should we always entertain the largest possible model
everywhere. This is again related to the question of the nature of
(sharp) null hypotheses (interesting in themselves or not).]
Efron and Gous (1998) [123]
Scales of evidence for significance tests and BF, with an attempt to
`reconcile' the two. Priors coincide with conventional rules for
significance tests; uses an approximate formula
$\baf(D)\approx L^{2}(D)/L^{2}(D_{0})$
where
$D_{0}$
is hypothetical data such that
$\baf(D_{0})=1$
. Discussion of Bayesian
`sample size coherency' (problem specification, including the prior,
should stay the same for all $N$
). with
arguments for and against.
Aitkin (1998) [5] Comment on Lindley's example in the discussion of Aitkin (1991) [4], pointing out a connection to Simplson's paradox. XXX