Davidov et al. (2011): An edited volume with a range of articles on the analysis of cross-national data. The focus is on multigroup latent variable models (including chapters on factor analysis, IRT and latent class models), with much discussion of questions of measurement equivalence. Many of the chapters are summarised separately elsewhere in this bibliography.
Meredith (1964a): Shows, in essence, that if a factor analysis model holds in a population, it also holds (with the same measurement model but a different structural model) also in subpopulations of it, as long as selection of the subpopulation does not depend on the observed items. This thus provides a justification that cross-group measurement equivalence can actually exist in general.
Meredith (1964b)(B): Proposes a method for estimating the factor loadings under an assumption of cross-group equivalence, where (i) a factor analysis model (with the same number of factors) is fitted in each of the several groups, (ii) estimated loadings are scaled by observed item standard deviations, and (iii) the scaled loading matrices are rotated to a single loading matrix (chosen by a least squares-type criterion).
Joreskog (1971): Proposes the standard multi-group factor analysis model (“simultaneous factor analysis in several populations”), and discusses identification and estimation for it. In the model, factor means are assumed to be 0 in each group (and item means vary freely between groups), so only variances and covariances are compared between groups. Proposes the sequence of hypotheses of cross-group equality (observed covariance matrices, number of factors, factor loadings, error variances, covariance matricers of factors) which has subsequently become conventional.
Sorbom (1974): Multi-group CFA where the factor means are also estimated as model parameters, simultaneously with the other parameters. The model assumes measurement equivalence in the item means, but allows for the possibility of non-equivalence in the other measurement parameters.
Werts et al. (1976): Points out how the multi-group CFA model can also be used to test hypotheses about cross-group equalities of variances, covariances etc. of observed variables or one with known measurement error variances (by setting residual variances to 0 or a fixed value, and so on).
Sorbom (1978): Multi-group structural equation model (with a regression between two latent variables) motivated as an extension of analysis of covariance to allow for measurement error in the covariate. Non-equivalence of measurement is mentioned as a possibility although not included in the two examples (except in the error variances).
Knight (1978): A review of the state of the art of factor analysis as of 1978, including a brief summary of multiple-group analysis.
Alwin and Jackson (1980): A review of the state of the art of factor analysis for multiple indicators in surveys as of 1980. Includes an overview of multiple-group modelling, as in Joreskog (1971). An example with 8 items on occupational aspirations, 2 factors and 2 groups (male vs. female). In the example, all invariance models (from the conventional sequence of tests) are rejected.
Bentler (1980): A broad and informative didactic review of factor analysis (in a broad sense). Very little on multiple-group analysis, but points out in passing that, in that and other contexts, models for mean structures had not often been used by this time.
Wilson (1981): Considers factor analysis models fitted separately to different groups. Points out that structural models are not directly comparable across groups because of arbitrariness of the latent scale, unless sufficiently strong assumptions are made. Suggests a particular scaling method which aims to achieve such comparability (this does not make use of joint modelling of the groups).
Lee and Tsui (1982): Provides asymptotic properties of parameter estimates and chi-squared tests for a multigroup covariance structure (SEM) model, estimated (possibly with across-group constraints) using generalised least squares estimation (with reference to close similarity to ML estimates).
Marsh and Hocevar (1985): An example of multi-group nalysis and invariance testing in a higher-order factor analysis model (i.e. one where first-order factors are in turn regarded as measures of a second-order factor), applied to measures of self-concept. A discussion of what invariance means in this context.
Bieber (1986): A procedure, building on the target rotation approach of Meredith (1964b), which aims to identify a hierarchical sequence of factors which are invariant (exist and have same measurement parameters) in a sequence of subset of groups (e.g. one which holds for all groups 1,2,3, one which is shared only by 2 and 3, etc.).
Byrne et al. (1989): Discusses the case of testing group differences in means and covariances of latent variables, given partial measurement equivalence (i.e. equivalence for some items but not all). Extended example of a four-factor model academic self concept, comparing two student groups. Also serves as an authoritative summary of developments in multi-group factor analysis up to this point. Points out that previously there were no examples of examining partial non-invariance, but that all previous applied and theoretical papers considered a sequence of tests of invariance in all parameters of one type at a time (e.g. loadings of all all items, etc.), and that estimation and comparison of mean structures across groups had been rare.
Muthen (1989): Psychometric Society Presidential address. Discussion formulated linear factor analysis models, but points out the use also for categorical items through polychoric correlations (with examples). Two main parts, corresponding to two ways of allowing for between-group heterogeneity: (1) MIMIC models where factor means and item intercepts may vary between groups (contrasted to multiple-group analysis, arguing that the latter requires larger samples to be useful); (2) random-effects factor analysis models, with random effects possibly for both factor means and item intercepts.
Watkins (1989): Overview discussion of the advantages of confirmatory vs. exploratory factor analysis. Briefly mentions examination of measurement invariance.
Meredith and Millsap (1992): discuss the case where a manifest mariable (Z) is used as a proxy or a substitute for a latent variable (W) for testing measurement invariance or item bias for variable (Y). Procedures are developed for testing measurement invariance using only observed measurements. They conlcude that bias detection procedures that rely strictly on observed variables are not in general detect measurement bias or lack of bias.
Steenkamp and Baumgartner (1998): A good concise summary of the conventional terminology and sequence of modelling when examining measurement invariance in factor analysis. Three-country example from consumer research, where a partial invariance model is selected.
Cheung et al. (1999): Discusses methods of identifying non-invariant items in CFA. Begins with a summary of the conventional multigroup CFA procedures. Considers models without means, i.e. for items centred within each group. Notes that testing for invariance in an item requires another item (the referent) to specified as invariant. This is problematic if the referent is itself actually noninvariant. The paper proposes carrying out the tests with each item in turn as the referent, to identify sets of items which are non-invariant.
Vandenberg and Lance (2000): this is a review paper that summarizes recommended practices followed by researchers for establishing measurement invariance in multi-group factor analysis type models with continuous responses. All various stages of measurement invariance are discussed.
Millsap (2007): the paper discusses the differences and relationship between measurement invariance and predictive invariance. No latent variables are involved in testing for predictive invariance. Any link between the two simplifies testing for measurement invariance.
Rijmen, von Davier, and Yamamoto (Rijmen et al.): the paper addresses item invariance by treating the item parameters as random effects coming from a common distribution.
Carle (2010): Summary of the use of latent variable models for assessing measurement invariance (called DIF), focusing on survey data. Considers linear factor analysis models, including their use for categorical response variables through the latent-response formulation. Standard discussion of types of non-invariance and model assessment through fit indices. An example with survey data, with 2 groups and 6 binary items.
Clogg and Goodman (1984): Clear, and I believe the first, exposition of multiple-group latent class (in the authors’ terms, “latent structure”) models. Fitting as a restricted model for the items+group contingency table (less relevant these days) and discussion of identifiability. Typology of different types of models, distinguishing between models which involve constraints within groups (restricted vs. unrestricted models) and/or between groups (in the authors’ terms, homogenous and fully or partially heterogenous models; completely homogenous is defined to require also equality of latent class probabilities across groups). Model selection using likelihood ratio tests of nested models. Two examples, including a two-group analysis of attitudes towards science.
Clogg and Goodman (1985): A version of Clogg and Goodman (1984) for a sociological audience. Comments of the analogies with linear factor analysis. Extended example with a two-group analysis where a homogenous model is selected.
Clogg and Goodman (1986): The ideas of Clogg and Goodman (1984) applied to certain measurement models which are all probabilistic extensions of the deterministic Guttman model.
Dayton and Macready (1988): Latent class model with general covariates, i.e. generalising the multiple-group model of Clogg and Goodman (1984) to allow also continuous covariates. Assumes measurement equivalence throughout. Proposes a simplex algorithm for estimation, and uses standard tests for assessment of (absolute and relative) model fit. An example with a two-class model for binary test items on mathematics, with sex and an ability test score as covariates.
McCutcheon (1996): Binary or ordinal items, latent class and a group variable, where the items and the latent class variable are treated as ordinal and fixed scores are assigned to their levels. The group variable may be treated as ordinal in the same way, or as nominal. The measurement models are Goodman’s linear-by-linear association models; these are thus equivalent to adjacent-categories ordinal logistic models for the items, given latent-class scores (treated as interval-level variables) and the group variable (treated as categorical) (this part is similar to the models in Clogg (1988), with the addition of the group variable). Cases with and without measurement equivalence (paper uses terms homogenous/heterogenous [measurement models]) are considered, with nice comments on the (lack of) meaning of the results if these models are too heterogenous. The models for latent class given group are Goodman’s association (row-, column- or row-and-column-effect) models. Example of US attitudes to abortion (4 binary items), with time as the group variable. A homogenous measurement model is selected (using L2 and BIC).
Bandeen-Roche et al. (1997): Latent class model with covariates which may also be continuous, i.e. the case considered by Dayton and Macready (1988). Proposes EM algorithm for estimation. Explicitly assumes measurement equivalence (in their terms, “nondifferential measurement”). Notes that when it holds, the model marginalised over the distribution of the covariates is also a latent class model with the same number of classes, so selection of the number of classes can be done ignoring the covariate. Examples with binary items of self-reported health, including one where the measurement equivalence assumption fails. Several methods of model assessment, including one where (i) a latent class is assigned to each unit at random, based on the posterior class probabilities from a fitted model, and (ii) conditional independence of items of each other and of the covariates is examined, treating the assigned class as known. This is thus essentially a version of the conditional item bias methods discussed in Section 3, but using an initial latent class model to create the conditioning variable (a proxy for the true value of the latent class).
Kankaras et al. (2011): Overview of multigroup latent class models and their use for assessing measurement invariance. Considers also models with ordinal items or latent classes. Two illustrative examples using cross-national survey data.
Andersen (1980): Multiple-group latent trait model in the simplest possible case: (1) binary items, (2) Rasch-type model with a normally distributed latent trait, (3) item parameter equal across groups and assumed known. Estimation of latent means and variances, and (likelihood ratio) testing of equality of them across groups. Also considers the longitudinal case with 2 occasions, which includes a correlation of the values of the trait between the two occasions.
Muthen and Christoffersson (1981): Binary items, multivariate normal latent traits, and group variable. Measurement model is formulated using a continuous latent variable (normally distributed, hence probit measurement models) underlying the binary items, so it involves separate intercepts, loadings and residual variances (with identifiability issues discussed). Models with various levels of cross-group invariance in these, and cross-group comparisons the distributions of the latent traits. Estimation with GLS, using one- and two-way margins of the observed contingency table (i.e. through estimation of polychoric correlations). Three examples of increasing complexity, involving 1 or 2 latent traits (selection using significance tests only). In the last example, measurement equivalence for one trait but not the other; conclusion is not to compare the distributions of the latter between groups (i.e. it is effectively treated as a different variable in different groups).
Muthen and Lehman (1985): The models of Muthen and Christoffersson (1981) in an IRT context of item bias. Considers the special case of that model where the residual variances of the underlying variables may differ between groups, but intercepts and loadings do not (comments on analogous factor analysis models for continuous items). This corresponds to a (probit) measurement model for a binary item which differs between two groups in that both intercept and loading are multiplied by the same factor (i.e. the item characteristic curve is flattened or steepened by this factor). Then compares means and variances of the latent trait between groups, after allowing for this difference in measurement models (and implicitly argues that comparability is then unproblematic).
Kelderman (1989): Group comparisons and differential item functioning (item bias) in educational testing context. Binary items, for which a Rasch model is assumed (i.e. model with equal loadings for all items). In this case the observed total score (T) is a sufficent statistic for the latent variable, and the model can be fitted as a quasi-independence loglinear model for the contingency table involving T and the items (i.e. this avoids the explicit use of latent variables, a computational trick which is less important nowadays). Group variable (G) is then added, and group differences in ability and/or item functioning are specified as interactions involving G in the log-linear model. Model assessment using global and nested likelihood ratio tests. Also considers the case where the group variable may be latent, which leads to a latent class model [whereas we consider only cases where it is observed; allowing it to be latent seems to me to lead to rather overgeneral models and possibly problems with identifiability and estimation].
Mellenbergh (1989): A good overview of methods related to item bias [DIF] in educational testing at the time of writing. Definition of item bias. Contrasts unconditional and conditional (IRT) methods of detection, and discusses mostly the latter. Methods mostly somewhat old, in particular joint modelling combined with testing mentioned only briefly. Instead, (i) methods which fit models separately for groups but with estimated difficulty parameters standardised within each group, followed by cross-group equality testing of parameters; (ii) indices of fit based on fitted parameters and item characteristic curves. Comments on approaches to explanation of item bias. Recommends removing biased items to create “pure” measures of ability.
Kelderman and Macready (1990): Group comparisons and differential item functioning in educational testing context. Considers various possible cases. First, when the latent variable is continuous, the approach is as in Kelderman (1989). Second, when it is categorical, a latent class model is used (analogously to Clogg and Goodman 1984). The group variable may be observed or categorical or both (thus involving two categorical latent variables, if ability is also categorical).
Millsap and Yun-Tein (2004): The paper sets the conditions of measurement invariance for the case of ordinal observed variables. The approach used for modelling the ordinal responses is based on the underlying variable approach, and the paper also discusses issues of identification and model specification. Details of identification conditions are given. The analysis is done with LISREL and Mplus, and their parametrisations of the multiple-group underlying-variable model are compared.
Drasgow and Probst (2005): Describes the use of IRT models to assess measurement equivalence in the context of adapting educational and psychological tests into multiple languages and cultures. (A chapter in Hampleton et al. (2005).)
De Jong et al. (2007): Overview of multigroup CFA, with focus on cross-national consumer research. Smilarly, overview of IRT models for ordinal items, and assessing measurement invariance for them. The paper proposes a multilevel (random-effects) model for both the latent variable and the measurement model parameters. There are then no invariant items, but all are imbedded in this model (there are still identifiability constraints, but they are less transparent than usual). Estimation with MCMC and model assessment using Bayes factors and other tools. Example with 11 countries and 8 items.
Soares et al. (2009): DIF in 3-parameter logit models (although DIF for guessing parameters is not considered) for binary items. Sets up models where magnitude of DIF in difficulty and/or discrimination parameters may itself depend on explanatory variables. Probability of the presence of DIF may also be an estimable parameter (even for all items, in which case informative prior distributions are required for identification). Bayesian model formulation and estimation through MCMC methods. Simulation and an example using educational testing data.
Woods (2009): Latent trait models for binary and ordinal items, in a psychological testing context. Considers testing for DIF between 2 groups with two approaches, referred to as MIMIC and two-group methods. What these amount to are models where in the non-DIF case only the intercept (difficulty) parameters (MIMIC approach) or both intercept and slope (discrimination) parameters vary between groups (two-group approach). These two cases are referred to as uniform and non-uniform DIF respectively. In both cases, comparison to non-DIF models using LR tests in both cases. Simulation studies to compare power of the tests and bias in estimated item parameters between the two approaches.
Fox and Verhagen (2011): Multigroup IRT (probit) models for binary items. A linear multilevel (random effects) model is specified for the latent trait given covariates, and also random effects models for the parameters (difficulty and discrimination) of the measurement models. Estimation is with MCMC methods. A simulation study and an application to educational testing data (from the PISA programme) with 40 countries and 8 items.
Janssen (2011): A concise overview of the use of IRT modelling to investigate Differential Item Functioning, especially in educational and psychological testing.
Stegmueller (2011): Multigroup IRT model with ordinal logistic measurement models, individual-level covariates, and random-effects model for country differences, with a discrete random effect. Non-equivalence of measurement is represented by between-country variation in the intercepts (but not loadings) of the measurement models. 12-country example on preferences for social spending. Interpretation is based on the model with maximal amount of non-equivalence, with no discussion of possible disadvantages of this approach.
Bejar (1980): Effectively a sensitivity analysis of the effect of incorrectly assumed measurement equivalence on estimates of latent trait means in a one-trait model for binary items with a probit measurement model. The calculations are done under a simple model for two groups where within each group the measurement model is the same for every item, and the true latent means and variances are the same across groups. Closed-form results are then presented in the bias in between-group mean difference when the true discrimination parameters (loadings) are not equivalent across the groups. The results are not quite correct (since they seem to assume (details are not given) that the estimates from the joint model would be equal to the parameters in one of the groups) but may be suggestive of the magnitudes of the biases.
De Beuckelaer and Swinnen, 2011: Extensive simulation study of the effect ignoring measurement noninvariance on conclusions about factor means in linear factor analysis. In the simulation design, always 2 groups, and non-invariance in (possibly) 1 of 3-4 indicators for 1 latent factor. Various other dimensions of the design varied across the simulations. The focus is on the correctness of conclusions from a significance test of no between-group difference in latent means. Classification tree analyses used to summarise the results, i.e. which design factors had the most effect on the sensitivity of the conclusions. The broad finding is that conclusions from a test are quite sensitive to incorrect assumptions about measurement invariance, under a wide range of design conditions.
__________________________________________________________________