Subset selection for linear regression

Next: Index of dissimilarity Up: Cross-validation and other predictive Previous: Other predictive criteria Contents

Subset selection for linear regression

Mallows (1964) [255] Apparently the earliest reference to Cp.

Gorman and Toman (1966) [167] Properties of Cp.

Lindley (1968) [245] Bayesian / decision-theoretic formulation of the subset selection problem in linear regression.

Darlington (1968) [104] Various comments on the use and interpretation of multiple linear regression, very much in the spirit of MSEP and cross-validation ideas. Includes comments on subset selection and relative importance of predictors.

Kennard (1971) [220] Cp and adjusted R2.

Mallows (1973) [256] Linear models, future X fixed. Discussion of Cp and plots of it vs. p. Notes on use in the context of ``best subset selection''. Similar measures for ridge regression etc.

Allen (1974) [17] Variable selection for linear regression. [Among other things,] proposes PRESS $= \sum(Y_{i}-\hat{Y}_{-i})^{2}$ as a criterion (est. of MSEP).

Diehr and Hoflin (1974) [112] Simulations of the distribution of R2 for selected model in best subset regression, illustrating the inflation of R2 in such cases. Approximate formulas for tail probabilities.

Narula (1974) [281] MSEP for regression where predictors are (normal) random variables. Estimated MSEP as a criterion for choosing a subset of predictors. Also a shrinkage approach with estimated best (for MSEP) shrinkage parameter.

Browne (1975a) [63] Linear regression. Description vs. prediction, with emphasis on the latter. Mean and variance of squared correlation between a new data point and prediction, illustrates effects of overfitting. Estimates of the correl. without validation data (i.e. using calibration data only).

Browne (1975b) [62] Like Browne (1975a) [63], but considering MSEP instead of squared correlation. Two estimates, one using one sample only, the other with data split into two.

Berk (1978) [36] Comparing results of backward and forward selection and all subsets regression in theory and examples.

Rencher and Pun (1980) [313] Simulations and asymptotic formula for the distribution of R2 for the model selected using some subset selection method.

Hjorth and Holmqvist (1981) [186] Considers cross-validation where the subset selection procedure is repeated for every validation data set to avoid selection bias. Multivariate autoregressive models, but applies to linear models as well.

Kempthorne (1984) [219] Linear models with sigma2 known, subset selection procedures combined with least squares estimates. Shows that all such procedures are admissible with respect to expected squared error of prediction, i.e. none gives estimates which are uniformly worse than estimates obtained in some other way.

Miller (1984) [273] A discussion paper covering some of the topics in Miller (1990) [272].

Miller (1990) [272] Book on subset selection for linear models, when the objective is goodness of prediction. Description of various subset selection procedures, with much computational detail. Discussion and estimates of selection bias, i.e. bias in estimated coefficients and RSS given that a variable for selected (depends on the selection procedure). Estimates of MSEP for fixed and random future Xs, discussing assumptions and connections of Cp, adjusted R2, AIC etc.

Hurvich and Tsai (1990) [197] Subset selection for normal linear models. Considers coverage of confidence intervals for parameters when subset selected using some criterion (AIC and BIC considered). Simulations shows that the coverage rate conditional on selected order is much smaller than the nominal rate. A data-splitting approach suggested instead.

Breiman (1992) [58] Subset selection for normal linear models. Having obtained a sequence of models with different dimensions using some selection procedure, how to (1) select dimension of final model, and (2) evaluate the MSEP (or rather the `model error' $\E||\hat{Y}-\mu||^{2}$ ). Comparing $C_{p}$ (assumptions do not hold when sequence of models is first selected using a selection procedure), bootstrap, data splitting and `little bootstrap' proposed in the paper. Large simulation to compare properties. Here $\mathbf{X}$ are regarded as fixed, also notes on the $\mathbf{X}$ -random case (more on that in Breiman and Spector (1992) [59]).

Breiman and Spector (1992) [59] The same situation as in Breiman (1992) [58] but for the X-random case where the future data are assumed to be an independent sample from the same distribution as observed (Y,X). Large simulations comparing complete and `V-fold' cross-validation, bootstrap abd `partial C-V' where the subset selection exercise is not repeated for every C-V data set like in the other approaches (this performs very badly). Nicely written.

Ronchetti and Staudte (1994) [320] Proposes a robust version of Cp.

Weiss (1995) [382] Variable selection for GLM (linear in particular) models. Considers `influence measures' for assessing how much posterior of some parameter under unrestricted model changes when some other parameters are restricted (e.g. set to zero), i.e. essentially compares marginal and conditional priors.

Next: Index of dissimilarity Up: Cross-validation and other predictive Previous: Other predictive criteria Contents

Jouni Kuha 2003-07-16