Cardall and Coffman (1964): [Abstract only] Early example of unconditional method of assessing item bias in educational testing. Guessing from the abstract: An analysis of variance model with group and item as factors and (transformed) difficulty (proportion answering correctly?) as the response. Testing for item-group interaction (i.e. between-group difference is not the same for all items), which would suggest item bias under a Rasch-type model.
Angoff and Ford (1973): Assessing item bias in two groups (whites and blacks), by comparing between group correlations of observed (transformed) difficulties of a large number of test items. This is analogous to the unconditional ANOVA approach of looking for item-group interactions (e.g. Cardall and Coffman 1964). Here, however, also a step in the direction of approximate conditional methods, in that groups are also matched on distributions of another aptitude test (verbal when considering item bias in a mathematics test, and vice versa), which can be seen as partially controlling for ability.
Scheuneman (1979): A conditional method of assessing item bias in binary items, two-group comparison. Total score for a scale (including the item being studied), grouped into intervals, is used as a proxy for ability. Test of item bias is then carried out using a test which is a bit like a Pearson chi-squared test of indepence of group and correctness of response, conditional on ability group (but not quite: uses only the contribution of the positive responses to the chi-squared statistic - see below).
Mellenbergh (1982): Clear explanation of the models for total-group-item tables, where “total” is a grouped value of a total score (proxy for latent ability, including or excluding the item under study) and group-item associations are used to study item bias. Formulation of this as log-linear and (equivalently) logistic model. Distinguishes between uniform (main effect of group on item only) and nonuniform (group-ability interaction also) item bias.
Drasgow (1982): Uses a numerical example to illustrate that the method of comparing the correlation (or regression) between a total test score and a criterion (another measure of ability) may not be a very sensitive assessment of non-equivalence of measurement.
Sireci et al. (2005): Review of methods of assessing measurement equivalence in adapting educational and psychological tests into multiple languages and cultures. Focus is on methods which do not involve latent variable models, although those (in the form of IRT models) are also mentioned. (A chapter in Hampleton et al. (2005).)
__________________________________________________________________