[Takeuchi (1976) [363]] First to propose TIC as a more accurate estimate than AIC. Not read, as my Japanese is a bit shaky.
Findley (1985) [131] Linear time series models, stationary Gaussian ARMA as example. Asymptotic bias of AIC (i.e. of observed likelihood) when true model is in the class considered, or is not but is close.
Hurvich and Tsai (1989) [196] Small-sample correction of AIC, extending results of [359] to nonlinear regression (still with normal errors) and Gauessian time series models. Notes on the small-sample bias of AIC, and examples in simulations.
Hurvich et al. (1990) [194]
Gaussian ARMA models, containing the true model. AIC$_{c}$
is
still biased, improved version proposed. Shows that the bias term is
asymptotically (and even in small samples approximately) independent of
$\theta_{0}$
; estimated by simulation and tabulated. Simulations show
improvement over $\text{AIC}_{c}$
, especially when $d/N>1/2$
.
Hurvich and Tsai (1991) [198]
Normal linear regression and normal AR, when the true model is not
included in the candidates. Bias of $AIC$
and $AIC_{c}$
.
Simulations showing that the bias of $AIC_{c}$
can be much
smaller than that of AIC, and model selections better, also compared to
results from BIC.
Ripley (1995) [314] Model selection for a neural network: choosing degree of smoothness and number of `hidden units' rather than covariates. Cross validation, AIC and NIC; BF and model averaging briefly.
Ripley (1996) [315] Model selection in the context of pattern recognition and neural networks. Discusses AIC, NIC and BIC, as well as general issues.
Konishi and Kitagawa (1996) [226] Derives estimates of the expected log-likelihood (as in derivation of AIC) when (i) true model is not necessarily in the class considered, and (ii) parameters can be estimated by other than MLE (e.g. robust, panalised likelihood or Bayesian estimates). With (i) + MLE, we get TIC. Examples of bias in a normal mixture example. Also describes bootstrap estimation of the expected log-likelihood (giving criterion called EIC). A nice summary of the entropy maximization paradigm in the introduction of the paper.
Fujikoshi and Satoh (1997) [150]
Multivariate normal linear models, where the largest model contains the
true model (but some of the submodels considered do not, i.e. they are
underspecified). Derives a
corrected AIC which is consistent $(O(n^{-1}))$
for underspecified
models and better than AIC (with $O(n^{-2})$
bias) for overspecified
models. Similar exercise for $C_{p}$
, for which
Kieseppä (1997) [224] Paper from a philosophy of science journal. Shows (without putting it quite like this) that for normal linear models AIC is unbiased for expected log-likelihood even when the model does not hold.
Shi and Tsai (1998) [336] Generalises AIC in several ways: (i) M-estimators rather than MLEs, (ii) expected `K-L' distance between estimating functions rather than loh-likelihoods, (iii) small-sample corrections. Simulations to compare performance in model identification.
Hurvich et al. (1998) [195]
Selection of smoothing parameter for nonparametric estimation of an
unknown smooth regression function. Derives improved (less biased)
versions of AIC; the simplest is essentially generalisation of
$\text{AIC}_{c}$
to this case.