Cox and Hinkley (1978) [95] Derivation of an approximation of BF; BIC is a special case. Discussion of the choice of priors for the two models. (A solution to an exercise in Cox and Hinkley (1974) [96].)
Schwarz (1978) [332] BIC obtained as model selection criterion for linear exponential family modeld with bounded priors. Motivated as method of choosing dimensionality of a model, e.g. degree of polynomial regression or order of Markov chain.
Smith and Spiegelhalter (1980) [342]
BF for choice between nested
linear models (main conclusions hold more generally) under different
priors for parameters. A prior with constant (w.r.t. $n$
) variance
leads to BIC-type BF; prior for larger model which gives nonnegligible
weight to neighbourhood of smaller model gives an AIC-type BF (with
factor 3/2 instead of 2). General discussion of model sel. criteria of
type $L^{2}-k \Delta_{df}$
with $k$
constant or function of $n.$
Lindley's paradox.
Pericchi (1984) [292] Suggests assigning prior model probabilities in a way (based on `expected gains in information about the parameters') that avoids Lindley's paradox. In some cases leads to penalised criteria with a constant penalty term or even to just the deviance. Normal linear model as an example. [Did not follow this very well.]
Haughton (1988) [180] Gives a[n even] more precise statement of the results of Schwarz (1978) [332] and extends them to the case of the curved exponential family. Consistency of BIC under certain conditions.
Kass and Vaidyanathan (1992) [214]
Testing a sharp null hypothesis (nested models). Laplace approximation
of BF and its accuracy. Sensitivity of the result to changes in the
prior: (1) insensitivity to the prior of the nuisance parameters under
`null orthogonality' and when true value of $\theta$
is close to
the null value; (2) lower bound for BF over all normal priors centered
at $\theta_{0}$
; (3) transformation of Bf from one set priors to
another. Example demonstrates these and the sensitivity of BF to the
prior variance of $\theta$
.
McCullogh and Rossi (1992) [265] BFs for hypotheses which involve nolinear restrictions. Projection methods to define priors from priors for nonrestricted models. Monte Carlo integration for the computations.
Kass and Wasserman (1995) [215]
Choice of reference priors for BF when comparing two nested models,
i.e. testing hypothesis $H_{0}: \psi=\psi_{0}$
, with nuisance
parameters $\beta$
. Assume $\psi$
and $\beta$
null
orthogonal and prior for $\beta$
same for both models. Laplace
approximation for BF. For prior for $\psi$
under $H_{1}$
,
assume (a) elliptically symmetrical, (b) information equal to info in
one observation. If (a)+(b)+prior normal, get BIC, approx. of log BF
with $O(n^{-1/2})$
error. If prior Cauchy, get BIF+constant, also
a version of a criterion by Jeffreys (error $O(n^{-1/2})$
).
Examples.
Raftery (1996) [309] Approximations of BF based on the Laplace approximation. One further assumption / approximation yields BIC. Applied to generalized linear models. A set of proper reference priors based on null hypothesis of no predictors; choice of parameters for prior. Mainly choice of predictors, choice of link functions, error distributions / variance functions also discussed. Model averaging. Raftery (1994) [306] is the same as technical report, with some further numerical results. Raftery (1988) [305] is an even earlier version, with a different example on social class and educational achievement.
Hsiao (1997) [193]
Laplace approximation to $p(y|M)$
when the
integrand has a boundary mode. Resulting approximation is a small
modification of BIC when only one parameter is on the boundary. Unlike
in the standard case, this approximation has always at least $O(1)$
relative error.
Pauler (1998) [289]
Variable selection in normal linear
models, nested hypotheses. Approximation of BF under fairly general
informative prior [nothing new there]; particular choices of prior lead
to BIC and other proposed criteria. Comparison of these in examples.
Careful discussion of conditions and assumptions. Also considers
mixed linear models (choosing fixed effects). There a key problem is the
determination of `$n$
' in the BIC formula (the order of the determinant
of the information matrix). This depends on which hypotheses are tested
and which random effects (if any) are associated with each fixed effect.
A dramatic example where the effective sample size is much smaller than
the total sample size and leads to a very different conclusion. [Read
this again]