Johnson (1969) [Not read]: Application of the target rotation method of Meredith (1964b). Data on educational abilities and aptitudes for subjects from Rhodesia and Zambia.
Rock and Freeberg (1969) [Not read]: Application of the target rotation method of Meredith (1964b). Data on a set biographical information questions administered to students at three grade levels.
McGaw and Joreskog (1971): Multi-group CFA for 12 aptitude and achivement tests (with 4 factors), for groups defined by intelligence and socioeconomic status. The fitted model assumes measurement equivalence in item means and loadings (but not in residual variances) but factor means (as well as covariance matrices) vary across groups (so the specification is somewhat different from Joreskog (1971) which this paper draws on). The model fitted reasonably well. Even though factor means are included as parameters of the model, they are not estimated as part of the parameter estimation but in postestimation, as group means of factor scores. Discussion of comparisons of factor means and covariance matrices between the groups.
Bechtoldt (1974): Essentially a demonstration that multiple-group factor analysis (as in Joreskog (1971)) works in practice. Applies it to old data from Thurstone, which involves two distinct random samples from the same population, and shows that invariance of factor loadings can be concluded.
Drasgow and Kanfer (1985): Multi-group CFA used to assess measurement equivalence in a scale for job satisfaction (16 items, formed from 54 original items; 4-factor model). Two studies, both within the same country (US) and same language: (i) with five relatively homogenous subpopulations within the same industry (5 different hospitals) and (ii) with samples from three different occupations. Non-equivalence was minimal in study (i) and small in (ii).
Hood et al. (2001): Two-group (US and Iran) CFA for an 8-item, 3-factor Mysticism scale. Includes conventional measurement equivalence analysis.
Ghorbani et al. (2002): 2-country comparison (US and Iran). 3-factor CFA model for 12 items (each sum scores of original items), related to emotional intelligence. Assessment of measurement invariance using standard stepwise factor-analysis procedure. The final model has partial measurement invariance.
Mannetti et al. (2002): Need for [Cognitive] Closure Scale, considered in 4 countries. Five dimensions, with 3 Likert items for each (reduced from an initial total of 42 items). Five-factor factor analysis models, and also second-order factor models with 2 second-order factors. Similarity of factor structures (configural equivalence) first assessed by fitting models separately in the countries. Measurement equivalence then examined using multiple-group models. Model selection relies mostly on standard fit indices, which mostly support equivalence (whereas tests of nested models indicate lack of equivalence).
Brown (2003): A 16-item worry questionnaire. A two-factor model typically fits, corresponding to 11 and 5 positively and negatively worded items. Research question is whether this represents two substantive factors or one + method effect. Author prefers latter, and captures it using a 1-factor CFA model with error correlations among all pair os the positively worded items (instead of including one method factor, surprisingly). Routine two-group (men vs. women) CFA, which concludes measurement equivalence but difference in factor means.
MacPherson and Myers (2004): Example of multigroup CFA. 3 groups (white, English-speaking and Spanish-speaking Mexican) x gender; 8-item tobacco-related beliefs scale, apparently in English to all groups. Underlying-variable factor analysis for binary items, with smoking behaviour as covariate. Non-equivalence between some groups and between genders withing some groups.
Mathisen et al. (2006): Multilevel confirmatory factor analysis model implemented as separate models for between and within-group covariance matrices. Standard multigroup CFA for these two models, including assessment of measurement invariance between them.
Stein et al. (2006): Example of multigroup factor analysis. 13-item (in the end 10 used, as construct validity questionable for 3) Sense of coherence scale, 3 ethnic groups. For the 10 items, measurement equivalence is found to hold.
Davidov et al. (2008): 21 items capturing Schwartz’s 10 ‘basic human values‘, in 20 countries, using European Social Survey data from 2002. Begins with 20 separate confirmatory factor analyses, followed by a joint model - in both stages lack of discriminant validity is found between certain pairs of factors, which are then merged. The interim joint model comprises seven factors, with (theoretically plausible, suggested by modification indices) cross loadings for five of the 21 items. Item loadings can be constrained to be equal across all countries whilst retaining adequate model fit. Constraining item intercepts to be invariant across all countries compromises model fit too much; thus the model does not permit comparisons of all seven values factors over all 20 countries. However, there is evidence that ‘partial scalar invariance’ may be found with some further digging, i.e. enabling comparisons of mean levels of sub-sets of values factors across sub-sets of countries.
Davidov et al. (2008): Testing the association between two of Schwartz’s values and attitudes towards immigration in nineteen countries, using ESS data. First test for metric invariance, i.e. constraining factor loadings to be equal across all countries, for values items. Find this model an acceptable fit, but go forward with larger model suggested by modification indices, which includes various cross loadings and inequivalences (‘partial metric invariance’), but which does not change the indices of close fit substantially. Secondly, find clusters of countries by effect types (directions) and sizes of values on attitudes, by constraining pairs of most similar effects (akin to a backward elimination strategy from the full model with separate effects country by country) to be equal and using a chi-squared test to decide whether this significantly worsens model fit.
Davidov (2009): Testing discriminant validity and measurement equivalence of two dimensions of national identity, using ISSP 2003 national identity module questions in 34 countries. The two factors are deemed to be empirically distinct from each other, but cross loadings are required in some countries. Configural and metric invariance (but not scalar invariance) concluded to hold across countries. For assessment, uses modification indices, chi-squared global fit, pclose, RMSEA, SRMR, CFI and correlation between factors.
Hogan et al. (1993): Patterns of exchanging social support between parents and adult children. US survey data, with 8 binary items. Latent class analysis, with 4 classes (selection using overall LR test, index of dissimilarity and BIC; plus in early analyses a fifth, known class for children with coresident parents). No differences in measurement probabilities between any groups were thus considered. Assigned (modal) classes were then used as response variables in multinomial logistic models with groups variables (and others) as explanatory variables.
Liebler and Sandefur (2002): Patterns of exchanging social support with non-kin, comparing men and women. US survey data, with 6 binary items. First, latent class modelling was done jointly and differences in classes and measurement probabilities were tested and not found (details of this are not given). The rest of the analysis, however, is done separately for men and women. Models thus have different measurement models and even the intepretation of one of the classes is different in the two groups. Model assessment using overall LR test, BIC and index of dissimilarity. Finally, assigned (modal) classes derived from the latent class models are used as response variables in (multinomial logistic) regression models, which also include other group variables as predictors. The structure of the analysis is (apart from the separate analyses for men and women) very similar to Hogan et al. (1993).
Kankaras and Moors (2009): Data on solidarity attitudes from European Values Study, with 33 countries, and with 10 ordinal items (with 5 levels each) which are regarded as measures of 3 underlying latent variables. These are modelled using a multigroup version of the latent class factor model of Magidson and Vermunt (2001), where the latent variables are ordinal discrete variables with specified numbers of levels (here 2 for each factor). The models are compared using AIC and LR tests, considering equivalence models and models with non-equivalence in measurement intercepts (direct effects) or also in measurement loadings (interactions). The selected model has 7 out of the 10 direct effects. It is argued that this is the best model for interpretation. A sensitivity analysis of comparing estimated factor means under this and other models (with equivalence or interactions also). Also sensivity analysis of omitting one country outlier.
Kankaras and Moors (2011): Data on perceptions of the morality of others from European Values Study, with 33 countries, and with 8 items (with 4 levels each) which are treated as nominal in the analysis. These are modelled using a multigroup version of the latent class factor model of Magidson and Vermunt (2001), where the latent variables are ordinal discrete variables with specified numbers of levels. Here there are 2 such factors (with 2 levels each), one of which is interpreted as the content factor (the attitude) and one an extreme response pattern factor. The models are compared using LR tests and BIC. The selected model has non-equivalence in measurement intercepts. It is argued that this is the best model for interpretation. A sensitivity analysis of comparing estimated factor means under this and other models.
Siegers (2011): Multigroup latent class modelling of data on religious orientations (5 items, 11 countries, 5 classes in the final model). Model selection mainly with BIC. The final model has non-equivalence in some items in some classes, but in quite interpretale ways.
Kankaras and Moors (2012): European Values Study data on 4 different scales. Compares native residents of Luxembourg, immigrants in Luxembourg from 5 countries, and native residents of those 5 countries (controlling also for age and sex), with the aim of assessing relative importance of cultural background and national context of the country of residence. These are modelled using a multigroup version of the latent class factor model of Magidson and Vermunt (2001), where the latent variables are ordinal discrete variables with specified numbers of levels. Model selection mostly using BIC. Measurement models are mostly equivalent within Luxembourg but not between countries (the language(s) used for the within-Luxembourg survey are not mentioned).
Ellis et al. (1989): Begins with a general discussion of DIF, especially due to question translation, and IRT. Analysis of binary items on attitudes to mental health, comparing French and German samples. DIF analysis not using a joint model but with an older approach by Stocking and Lord (1982), which fits the model separately in the two groups and then attempts to link their latent scales. Discusses possible language difference to explain DIF in some items.
Miller (1998): This is the main summary paper of measurement of scientific ’literacy’, pools data from Europe and US, applies three-parameter latent trait models and assesses country differences according to mean scores on the latent trait. No formal evaluation of measurement equivalence, which is assumed without comment. Item sets are not identical for the two samples; all items in each country are used to derive the factor scores, with the common items linking the samples.
Carle (2009b): Measures of alcohol dependence, 27 items, large US sample, comparing Hispanics vs. non-Hispanic whites. Items treated as ordinal, estimation through polychoric correlations (and model assessment using linear factor analysis fit indices). Model with partial equivalence, where 9 of the intercept (threshold) parameters vary between the groups. Comments on the between-group difference of latent means with this measurement model vs. when assuming full equivalence.
Carle (2009a, B): Very similar to Carle (2009b), except for different set of items and inclusion of data from a second period (with different items, so no between-period comparison).
__________________________________________________________________