Next: Order selection of time Up: Model Assessment and Model Previous: Model averaging Contents

AIC and related methods

Kullback and Leibler (1951) [230] Original paper on Kullback-Leibler information.

Akaike (1973) [6] Information-theoretic motivation of AIC. Shows that minimizing AIC is approximately equivalent to minimizing the expected Kullback-Leibler distance between the true density and estimated density. The context is selecting the 'order' of a model (e.g. numer of factors, number of explanatory variables, order of AR) when the candidate models form a nested sequence and a `maximal' model is defined. Reprinted in Koz and Johnson (1992) [227] with a nice introduction by J. deLeeuw.

Akaike (1974) [7] Model selection, for time series analysis in particular. Suggests selecting the model (+ resulting parameter estimates) which minimizes AIC.

Tong (1975) [368] Use of AIC to determine the order of a Markov Chain.

Akaike (1977) [8] Outlines Akaike's model selection paradigm: (1) purpose of statistical inference is estimation of probability distributions of observed data, (2) performance of estimation is evaluated by goodness of fit of estimated to true distribution, (3) GOF measured by probabilistic entropy [equivalently, Kullback-Leibler information], (4) aim is to maximize expected entropy, thus yielding an estimate of both order and parameters. Relation to maximum likelihood. AIC, compared to BIC in simulations of a polynomial example.

Sawa (1978) [328] Proposes and analyzes statistical criteria for model idenitifcation. No true model, identify the most adequate (according to K-L info) among a set of alternatives. A good description of AIC. Questions AIC's assumption that the models are nearly correct, proposes (for linear models) an alternative which avoids this (called BIC, but I do not think it is the same as standard BIC). Also decision rules based on Bayes risk, i.e. expected loss with respect to a posterior distribution.

Zellner (1978) [394] Nested normal linear models. Approximation of BF under relatively flat priors, noting that AIC is not a good approximation of this.

Stone (1979) [358] Compares asymptotic properties of AIC and BIC. Questions asymptotics where the model is held fixed as $N$ increases.

Leamer (1979) [235] Argues that an information-theoretic selection criterion (e.g. AIC) cannot be regarded as a Bayesian estimation (decision) procedure, as the latter never recommends anything but the full model (`does not encourage parsimony'), unless this is built into the loss function. See also Chow (1981) [79] for a discussion of this.

Atkinson (1980) [23] Optimum $\alpha$ in generalised AIC for prediction purposes. Simulations of normal linear regression, MSE of prediction as criterion, $\alpha=1,2,3,4,5,6$ considered, optimum varies between 2 and 6. A nice discussion of the asymptotics of choosing the penalty term, including BIC.

Akaike (1981) [10] Considers links between entropy maximization principle (leading to AIC) and Bayesian model selection. When priors are vague, Bayesian approach is problematic. Akaike's approach essentially replaces (log) priors and likelihoods with predictive equivalents, which are expectations with respect to a predictive distribution (as in derivation of AIC). Comments on BIC. [Must say I do not understand this very well.]

Atkinson (1981) [24] Review of issues in order selection. Nothing new, but nicely written. Compares AIC, Bayesian approach, BIC, $C_{p}$ and others. Discusses consistency, finite sample properties and prediction performance.

Chow (1981) [79] Three main parts, the last most interesing. First derives TIC/NIC as an alternative to AIC. Second gives the familiar derivation of BIC as an approximation of posterior odds. Third compares the procedures. Emphasises that the two have different aims, one (AIC) to find best prediction, the other (BIC) two find the model with the highest posterior probability of being true.

Shibata (1981) [339] Linear regression model, assuming number of parameters infinite or $O(n)$ . Optimal selection criteria to minimise MSEP. Asymptotic equivalence of AIC and $C_{p}$ , both optimal in this sense; BIC not optimal. Also considers subset selection in rather formalised situations.

Katz (1981) [217] Selecting the order of a Markov chain. Asymptotic distribution of the model selected by AIC (inconsistent) and BIC (consistent). Simulations of small-sample behaviour.

Woodroofe (1982) [391] Properties of AIC and $C_{p}$ (and a bit on BIC) for nested models using the arc sine law (whatever that is).

Akaike (1983) [11] AIC as a BF when the prior variance is comparable to data variance. Comparison of AIC and BIC and discussion of the `generalized AIC' with 2 replaced by another constant.

Shibata (1984) [340] Normal linear models and generalized FPE criterion (which here includes all the usual penalized criteria). Asymptotic and finite-sample formulas for the MSEP of the selected models. Suggestions for the choice of the penalty constant.

Nishii (1984) [285] Selection of variables for a normal linear model. Derives asymptotic distributions of the selected model and a quadratic risk for AIC (and criteria FPE, $C_{p}$ and PSS, which are shown to be equivalent to AIC in this respect) and BIC. (In fact, both are generalized in that the penalty 2 in AIC can be some other constant and $\log n$ in BIC some other function of $n$ .) AIC is inconsistent (has a positive probability of selecting a model which contains the true model as a proper subset) and BIC consistent.

Akaike (1985) [12] Review article on `prediction and entropy'. Entropy and information, relation to maximum likelihood, emphasising interpretation from the predictive point of view. Derivation of AIC. Links (somewhat obscure to me) to a Bayesian approach.

Bozdogan (1987) [53] Gives the theory and derivation of AIC quite well. Also defines a consistent (for model order) `modification' of AIC (which is actually very close to BIC) derived along the same lines as AIC [I did not quite understand the derivation].

Akaike (1987) [13] Defines AIC for factor analysis models. Also presents a Bayesian approach (and an associated `Bayesian' AIC) to avoid undefined likelihoods due to overparametrized models.

Takane et al. (1987) [362] Proposes a semiparametric discriminant analysis method and AIC for it (with `likelihood' and `number of parameters' appropriately defined).

Bozdogan and Ramirez (1988) [54] Describes a computer program for fitting factor analysis models and selecting the number of factors using AIC and Bozdogan's [53] `consistent AIC' CAIC.

Read and Cressie (1988) [312] [BOOK] Brief discussion of AIC and BIC for categorical data in a book concentrating on goodness-of-fit using the power-divergence statistic.

Tong (1990) [369] In a book, derivation and discussion of AIC, especially for time series modelling. Brief comments on cross-validation, prequential approach and BIC.

Forster and Sober (1994) [143] A philosophy of science paper describing AIC. Discusses the penalty term and how this motivates the desire for simplicity in models.

Kukla (1995) [229] Comment to Forster and Sober (1994) [143]. Essentially discusses the apparently paradoxical results from using AIC when some parameters of the family of models are fixed (and then pretended to be known) in light of the data. Forster (1995) [141] gives a response.

Hurvich and Tsai (1995) [199] Modelling a smooth regression function (true nonparametric regression) with a parametric function chosen by AIC. Derives the relative rate of convergence of the mean integrated squared error for the AIC-selected estimator compared to the optimal estimator. Compared to the corresponding rate of convergence for nonparametric kernel smoothing with the smoothing constant selected by cross-validation.

Findley and Parzen (1995) [132] An interview of H. Akaike, with observations of the development of AIC.

Forster (1998) [144] AIC for philosophers of science. Lenghty and occasionally somewhat inaccurate description. Compares AIC-based selection to the Bayesian approach.

Burnham and Anderson (1998) [70] A book on model selection from an information-theoretic perspective, using AIC as the key criterion. See a separate review.

McQuarrie and Tsai (1998) [270] A book on model selection criteria in regression (mostly linear) and time series models. Properties of criteria (small-sample and asymptotic moments, signal-to-noise ratios (ST), probabilities of over- and underfitting etc.) presented in great detail. Extensive simulations under a variety of situations, both where the true model is included in candidates and where it is not. Most simulations with very small sample sizes, but a few with large ones. Because of the small-sample focus, focuses on small-sample adjustments and `ST-adjustments' of AIC.

Forster (1999) [142] Further explanation of AIC to philosophers; in particular, clarifying the invariance of `k' in the formula to one-to-many transformations of the parameters. Clearest of Forster's papers on the topic (for example, Kullback-Leibler divergence makes its first appearance).

Subsections

Next: Order selection of time Up: Model Assessment and Model Previous: Model averaging Contents

Jouni Kuha 2003-07-16