Assessment in Cross-Cultural Psychology Research Paper

View sample assessment in cross-cultural psychology research paper. Browse other research paper examples and check the list of psychology research paper topics for more inspiration. If you need a psychology research paper written according to all the academic standards, you can always turn to our experienced writers for help. This is how your paper can get an A! Feel free to contact our custom writing service for professional assistance. We offer high-quality assignments for reasonable rates.

Some say that the world is shrinking. We know that worldwide cable news programs and networks, the Internet, and satellites are making communications across cultures and around the world much easier, less expensive, and remarkably faster. Cross-cultural psychology studies the psychological differences associated with cultural differences. (In a strictly experimental design sense, in much of cross-cultural research, cultures typically serve as independent variables and behaviors of interest as dependent variables.) At one time, such research implied crossing the borders of countries, and generally, it still may. However, as countries become more multicultural due to immigration (made much easier in Europe with the advent of the European Union, or EU), different cultures may exist within a country as well as in differing countries. Research in psychology, too, has been affected by these worldwide changes.

Academic Writing, Editing, Proofreading, And Problem Solving Services

Get 10% OFF with 24START discount code

The world context has also shifted. The cultural makeup of the United States is certainly changing rapidly; recent U.S. Census Bureau analyses indicate the rapid increase in ethnic minorities, especially Hispanic Americans and Asian Americans, to the extent that the historical European American majority in American is likely to become a minority group within the next decade or so (see Geisinger, 2002, or Sandoval, 1998, for an elaboration of these data). While the United States is experiencing population changes with a considerable increase in groups traditionally identified as ethnic minorities, so have other parts of the world experienced these same population shifts. Many of these changes are occurring as “the direct consequence of cross-immigration and globalization of the economy” (Allen & Walsh, 2000, p. 63). Cultures change due to population changes caused by immigration and emigration, birth and death rates, and other factors, but they also change due to historical influences apart from population shifts. Countries that have suffered famine, aggression on the part of other nations, or other traumatic changes may experience significant cultural transformations as well as population changes. Most psychologists have studied human behavior within a single, broad culture, sometimes called Euro-American culture (Moreland, 1996; Padilla & Medina, 1996). In more recent years, psychologists have begun to recognize the importance of culture; in 1995 the American Psychological Association (APA) began publishing the journal Culture and Psychology. Such an event may be seen as an indication of the increased recognition of the importance of culture in psychology.

According to Berry (1980), cross-cultural psychology seeks to explore the relationships between cultural and behavioral variables. He included ecological and societal factors within the realm of cultural variables. Likewise, he included as part of behavioral variables those that must be inferred from behavior, such as personality, attitudes, interests, and so on. The variables that are studied in cross-cultural psychology, of course, must be measured and they have traditionally been examined using standardized tests, interviews, and a variety of formal and informal assessments. In fact, Triandis, Malpass, and Davidson (1971) describe crosscultural psychology as follows: “Cross-cultural psychology includes studies of subjects from two or more cultures, using equivalent methods of measurement, to determine the limits within which general psychological theories do hold, and the kinds of modifications of these theories that are needed to make them universal” (p. 1; also cited in Berry, 1980, p. 4). The previous statement emphasizes the need for equivalent methods of measurement. If the findings of cross-cultural psychology are to have validity, then equivalent measurement is required.This point is the general theme of this research paper. It is argued that to note cross-cultural differences or similarities in terms of psychological variables and theories, one must have confidence that the measures used in research are equivalent measures in each culture.

When one is comparing two cultures with respect to a psychological or other variable, there are a number of factors that can invalidate the comparison. For example, if one selects well-educated individuals from one culture and less well-educated persons from the second culture, the comparison is likely to be flawed. (See van de Vijver & Leung, 1997, for a more complete listing and explanation of these confounding factors.) However, one of the primary sources of invalidity, one that is often not understood as easily as the previous sampling example, relates to measurement instruments. When the two cultures to be compared do not employ the same language to communicate and have other cultural differences, the measures that are used in the comparison must of necessity be somewhat different. A number of the possible options are discussed and evaluated in this research paper. Depending upon the use and other factors, cross-cultural psychologists and psychologists dealing with testing issues in applied use are provided with some strategies for solving the dilemmas that they face. Language is generally not the only disparity when making cross-cultural comparisons. Cultural differences in idioms, personal styles, experiences in test taking, and a plethora of other variables must also be considered in making cross-cultural or multicultural assessments. These factors are often more subtle than language differences.

Thus, there are theoretical reasons to examine testing within cross-cultural psychology. There are also applied reasons that testing is important in cross-cultural settings. Obviously, the applied uses of tests and assessment across cultures must rely on the theoretical findings from crosscultural psychology. Tests and assessment devices have been found to have substantial validity in aiding in various kinds of decision making in some cultures. Thus, other cultures may wish to employ these measures, or adaptations of them, in what appear to be similar applied settings in cultures other than those where they were first developed and used.

Test development, test use, and other psychometric issues have long held an important role in cross-cultural psychology. Berry (1980) differentiated cross-cultural psychology from many other areas of psychology, and aligned it closely to measurement and methodology by reflecting that “most areas of psychological enquiry are defined by their content; however, cross-cultural psychology is defined primarily by its method” (p. 1; emphasis in the original). The words testing and assessment are used interchangeably by some psychologists, but differentially by others. When they are distinguished, testing involves the administration and scoring of a measurement instrument; assessment, on the other hand, is a broader term that includes score and behavioral interpretation in the context of the culture or the individual being evaluated.

Some Basic Distinctions

Before beginning a formal discussion of testing in crosscultural psychology, a few fundamental features must be differentiated. When one uses measures in two cultures, one engages in cross-cultural work. On the other hand, when one studies the various subcultures within a country, such as the United States of America, then one performs multicultural analyses. The distinction is somewhat more complex, however, and this demarcation in described in the next section. Next, one of the fundamental distinctions in crosscultural psychology—the concepts of etic and emic—are described; these terms in some ways parallel the distinction between cross-cultural and multicultural analyses. In brief, etic studies compare a variable across cultures whereas emic studies are performed within a single culture. Finally, a distinction between the uses of tests in relatively pure research as opposed to the testing of individuals in real-life, often high-stakes decisions is described.

Cross-Cultural and Multicultural Psychology

The differences between cross-cultural and multicultural are not entirely clear. Allen and Walsh (2000), for example, make the distinction that use of tests across cultural groups (“often internationally”) is a cross-cultural application of tests whereas the use of tests with individuals of differing ethnic minority (or perhaps cultural) status within a nation is a multicultural application of tests (p. 64). They note that there is often overlap between culture and minority group status. Clearly, the distinction blurs.

Thus, the techniques used in cross-cultural psychology have great applicability to multicultural, psychological issues. The questions and concerns involved in adapting a test to make a comparison between U.S. and Mexican cultures, for example, are certainly applicable to the testing of Spanishspeaking Chicanos in the United States.

The Concepts of Etic and Emic

Linguists often use two important words in their work: phonetic and phonemic. According to Domino (2000), “Phonetic refers to the universal rules for all languages, while phonemic refers to the sounds of a particular language. From these terms, the word ‘etic’and ‘emic’were derived and used in the cross-cultural literature. Etic studies compare the same variable across cultures. . . . Emic studies focus on only one culture and do not attempt to compare cultures” (p. 296). These terms were originally coined by Pike (1967). The words etic and emic also refer to local and universal qualities, respectively. Thus, the terms have been used to describe both behaviors and investigations.

Emics are behaviors that apply only in a single society or culture and etics are those that are seen as universal, or without the restrictions of culture.Acomplaint made about traditional psychology is that it has presumed that certain findings in the field are etics, even though they have not been investigated in non-Western arenas (Berry, 1980; Moreland, 1996). Thus, some findings considered etics are only so-called pseudo etics (Triandis et al., 1971). The emic-etic distinction is one that has broad applicability to the adaptation of tests developed in America to other cultural spheres.

The emic-etic distinction also applies to the goals of and approaches to cross-cultural research: The first goal is to document valid principles that describe behavior in any one culture by using constructs that the people themselves conceive as meaningful and important; this is an emic analysis. The second goal of cross-cultural research is to make generalizations across cultures that take into account all human behavior. The goal, then is theory building; that would be an etic analysis (Brislin, 1980, p. 391). In searching for emic findings, we are attempting to establish behavioral systems (or rules) that appear to hold across cultures. That is, we are endeavoring to verify that certain behavioral patterns exist universally. Etic studies look at the importance of a given behavior within a specific culture.

The Use of Tests and Assessments for Research and Applied Use

The goal of most cross-cultural psychologists and other researchers is the development of knowledge and the correlated development, expansion, and evaluation of theories of human behavior. Many applied psychologists, however, are more concerned with the use of tests with specific individuals, whether in clinical practice, school settings, industrial applications, or other environments in which tests are effectively used. The difference in the use of tests in these settings is significant; differences in the type or nature of the tests that they need for their work, however, may well be trivial. If we assume that the psychological variable or construct to be measured is the same, then differences required for such varied uses are likely to be minor. Both uses of tests, whether for research or application, require that the measure be assessed accurately and validly. Part of validity, it is argued, is that the measure is free from bias, including those biases that emerge from cultural and language differences. Some writers (e.g., Padilla & Medina, 1996) have accentuated the need for valid and fair assessments when the nature of the assessment is for high-stakes purposes such as admissions in higher education, placement into special education, employment, licensure, or psychodiagnosis.

The Nature of Equivalence

The very nature of cross-cultural psychology places a heavy emphasis upon assessment. In particular, measures that are used to make comparisons across cultural groups need to measure the characteristic unvaryingly in two or more cultural groups. Of course, in some settings, this procedure may be rather simple; a comparison of British and American participants with regard to a variable such as depression or intelligence may not produce unusual concerns. The language, English, is, of course the same for both groups. Minor adjustments in the spelling of words (e.g., behavioral becomes behavioural) would first be needed. Some more careful editing of the items composing scales would also be needed, however, to assure that none of the items include content that has differing cultural connotations in the two countries. A question about baseball, for example, could affect resultant comparisons. These examples are provided simply to present the nature of the issue. Cross-cultural psychologists have focused upon the nature of equivalence and, in particular, have established qualitative levels of equivalence.

Many writers have considered the notion of equivalence in cross-cultural testing. Lonner (1979) is acknowledged often for systematizing our conception of equivalence in testing in cross-cultural psychology. He described four kinds of equivalence: linguistic equivalence, conceptual equivalence, functional equivalence, and metric equivalence (Nichols, Padilla, & Gomez-Maqueo, 2000). Brislin (1993) provided a similar nomenclature with three levels of equivalence: translation, conceptual, andmetric, leaving out functional equivalence, an important kind of equivalence, as noted by Berry (1980), Butcher and Han (1998), and Helms (1992). van deVijver and Leung (1997) operationalized four hierarchical levels of equivalence as well, encompassing construct inequivalence, construct equivalence, measurement unit equivalence, and scalarorfull-scorecomparability.Itshouldbenoted,however, that like the concepts of test reliability and validity, equivalence is not a property resident in a particular test or assessment device (van de Vijver & Leung). Rather, the construct is tied to a particular instrument and the cultures involved. Equivalence is also time dependent, given the changes that may occur in cultures. Lonner’s approach, which would appear to be most highly accepted in the literature, is described in the next section, followed by an attempt to integrate some other approaches to equivalence.

Linguistic Equivalence

When a cross-cultural study involves two or more settings in which different languages are employed for communication, the quality and fidelity of the translation of tests, testing materials, questionnaires, interview questions, open-ended responses from test-takers, and the like are critical to the validity of the study. Differences in the wording of questions on a test, for example, can have a significant impact on both the validity of research results and the applicability of a measure in a practice setting. If items include idioms from the home language in the original form, the translation of those idioms is typically unlikely to convey the same meaning in the target language. The translation of tests from host language to target language has been a topic of major concern to cross-cultural psychologists and psychometricians involved in this work. A discussion of issues and approaches to the translation of testing materials appears later in this research paper.

Most of the emphasis on this topic has concerned the translation of tests and testing materials. Moreland (1996) called attention to the translation of test-taker responses and of testing materials. Using objective personality inventories, for example, in a new culture when they were developed in another requires substantial revisions in terms of language and cultural differences. It is relatively easy to use a projective device such as the Rorschach inkblots in a variety of languages. That is because such measures normally consist of nonverbal stimuli, which upon first glance do not need translating in terms of language. However, pictures and stimuli that are often found in such measures may need to be changed to be consistent with the culture in which they are to be used. (The images of stereotypic people may differ across cultures,asmayotheraspectsof the stimuli that appear in the projective techniques.) Furthermore, it is critical in such a case that the scoring systems, including rubrics when available, be carefully translated as well. The same processes that are used to insure that test items are acceptable in both languages must be followed if the responses are to be evaluated in equivalent manners.

Conceptual Equivalence

The question asked in regard to conceptual equivalence may be seen as whether the test measures the same construct in both (or all) cultures, or whether the construct underlying the measure has the same meaning in all languages (Allen & Walsh, 2000). Conceptual equivalence therefore relates to test validity, especially construct validity.

Cronbach and Meehl (1955) established the conceptual structure for construct validity with the model of a nomological network. The nomological network concept is based upon our understanding of psychological constructs (hypothetical psychological variables, characteristics, or traits) through their relationships with other such variables. What psychologists understand about in vivo constructs emerges from how those constructs relate empirically to other constructs. In naturalistic settings, psychologists tend to measure two or more constructs for all participants in the investigation and to correlate scores among variables. Over time and multiple studies, evidence is amassed so that the relationships among variables appear known. From their relationships, the structure of these constructs becomes known and a nomological network can be imagined and charted; variables that tend to be highly related are closely aligned in the nomological network and those that are not related have no connective structure between them. The construct validity of a particular test, then, is the extent to which it appears to measure the theoretical construct or trait that it is intended to measure. This construct validity is assessed by determining the extent to which the test correlates with variables in the patterns predicted by the nomological network. When the test correlates with other variables with which it is expected to correlate, evidence of construct validation, called convergent validation, is found (Campbell & Fiske, 1959; Geisinger, 1992). Conversely, when a test does not correlate with a measure that the theory of the psychological construct suggests that it should not, positive evidence of construct validation, called discriminant validation (Campbell & Fiske, 1959) is also found.

Consider the following simple example. Intelligence and school performance are both constructs measured by the Wechsler Intelligence Scale for Children–III (WISC-III) and grade point average (GPA)—in this instance, in the fourth grade. Numerous investigations in the United States provide data showing the two constructs to correlate moderately. The WISC-III is translated into French and a similar study is performed with fourth-graders in schools in Quebec, where a GPA measure similar to that in U.S. schools is available. If the correlation is similar between the two variables (intelligence and school performance), then some degree of conceptual equivalence between the English and French versions of the WISC-III is demonstrated. If a comparable result is not found, however, it is unclear whether (a) the WISC-III was not properly translated and adapted to French; (b) the GPA in the Quebec study is different somehow from that in the American studies; (c) one or both of the measured constructs (intelligence and school performance) does not exist in the same fashion in Quebec as they do in the United States; or (d) the constructs simply do not relate to each other the way they do in the United States. Additional research would be needed to establish the truth in this situation. This illustration is also an example of what van de Vijver and Leung (1997) have termed construct inequivalence, which occurs when an assessment instrument measures different constructs in different languages. No etic comparisons can be made in such a situation, because the comparison would be a classic apples-and-oranges contrast.

Ultimately and theoretically, conceptual equivalence is achieved when a test that has considerable evidence of construct validity in the original or host language and culture is adapted for use in a second language and culture, and the target-language nomological network is identical to the original one. When such a nomological network has been replicated, it might be said that the construct validity of the test generalizes from the original test and culture to the target one. Factor analysis has long been used as a technique of choice for this equivalence evaluation (e.g., Ben-Porath, 1990). Techniques such as structural equation modeling are even more useful for such analyses (e.g., Byrne, 1989, 1994; Loehlin, 1992), in which the statistical model representing the nomological network in the host culture can be applied and tested in the target culture. Additional information on these approaches is provided later in this research paper. (Note that, throughout this research paper, the terms conceptual equivalence and construct equivalence are used synonymously.)

Functional Equivalence

Functional equivalence is achieved when the domain of behaviors sampled on a test has the same purpose and meaning in both cultures in question. “For example, in the United States the handshake is functionally equivalent to the head bow with hands held together in India” (Nichols et al., 2000, p. 260). When applied to testing issues, functional equivalence is generally dealt with during the translation phase. The individuals who translate the test must actually perform a more difficult task than a simple translation. They frequently must adapt questions as well. That is, direct, literal translation of questions may not convey meaning because the behaviors mentioned in some or all of the items might not generalize across cultures. Therefore, those involved in adapting the original test to a new language and culture must remove or change those items that deal with behavior that does not generalize equivalently in the target culture. When translators find functionally equivalent behaviors to use to replace those thatdonotgeneralizeacrosscultures,theyareadapting,rather than translating, the test; for this reason, it is preferable to state that the test is adapted rather than translated (Geisinger, 1994; Hambleton, 1994). Some researchers appear to believe that functional equivalence has been subsumed by conceptual equivalence (e.g., Brislin, 1993; Moreland, 1996).

Metric Equivalence

Nichols et al. (2000) have defined metric equivalence as “the extent to which the instrument manifests similar psychometric properties (distributions, ranges, etc.) across cultures” (p. 256). According to Moreland (1996), the standards for meeting metric equivalence are higher than reported by Nichols et al. First, metric equivalence presumes conceptual equivalence.The measures must quantify the same variable in the same way across the cultures. Specifically, scores on the scale must convey the same meaning, regardless of which form was administered.There are some confusing elements to this concept. On one hand, such a standard does not require that the arithmetic means of the tests to be the same in both cultures (Clark, 1987), but does require that individual scores be indicative of the same psychological meaning. Thus, it is implied that scores must be criterion referenced. An individual with a given score on a measure of psychopathology would need treatment, regardless of which language version of the form was taken. Similarly, an individual with the same low score on an intelligence measure should require special education, whether that score is at the 5th or 15th percentile of his or her cultural population. Some of the statistical techniques for establishing metric equivalence are described later in this research paper.

Part of metric equivalence is the establishment of comparable reliability and validity across cultures. Sundberg and Gonzales (1981) have reported that the reliability of measures is unlikely to be influenced by translation and adaptation. This writer finds such a generalization difficult to accept as a conclusion. If the range of scores is higher or lower in one culture, for example, that reliability as traditionally defined will also be reduced in that testing. The qualityofthetestadaptation, too, would have an impact. Moreland (1996) suggests that investigators would be wise to ascertain test stability (i.e., testretest reliability) and internal consistency in any adapted measure, and this writer concurs with this recommendation.

Geisinger (1992) has considered the validation of tests for two populations or in two languages. There are many ways to establish validity evidence: content-related approaches, criterion-related approaches, and construct-related approaches. Construct approaches were already discussed with respect to conceptual equivalence.

In regard to establishing content validity, the adequacy of sampling of the domain is critical. To establish comparable content validity in two forms of a measure, each in a different language for a different cultural group, one must gauge first whether the content domain is the same or different in each case. In addition, one must establish that the domain is sampled with equivalent representativeness in all cases. Both of these determinations may be problematic. For example, imagine the translation of a fourth-grade mathematics test that, in its original country, is given to students who attend school 10 months of the year, whereas in the target country, the students attend for only 8 months. In such an instance, the two domains of mathematics taught in the 4th year of schooling are likely to be overlapping, but not identical. Because the students from the original country have already attended three longer school years prior to this year, they are likely to begin at a more advanced level. Furthermore, given that the year is longer, they are likely to cover more material during the academic year. In short, the domains are not likely to be identical. Finally, the representativeness of the domains must be considered.

Other Forms of Equivalence

van de Vijver and Leung (1997) have discussed two additional types of equivalence, both of which can probably be considered as subtypes of metric equivalence: measurement unit equivalence and scalar or full-score equivalence. Both of these concepts are worthy of brief consideration because they are important for both theoretical and applied cross-cultural uses of psychological tests.

Measurement Unit Equivalence

This level of equivalence indicates that a measure that has been adapted for use in a target culture continues to have the same units of measurement in the two culture-specific forms. That is, both forms of the measure must continue to yield assessments that follow an interval scale, and in addition, it must be the same interval scale. If a translated form of a test were studied using a sample in the target culture comparable to the original norm group and the new test form was found to have the same raw-score standard deviation as the original, this finding would be strong evidence of measurement unit equivalence. If the norms for the target population were extremely similar to that in the original population, these data would also be extremely strong substantiation of measurement unit equivalence.

Scalar (full-score) Equivalence

Scalar equivalence assumes measurement unit equivalence and requires one additional finding: Not only must the units be equivalent, the zero-points of the scales must also be identical. Thus, the units must both fall along the same ratio scale. It is unlikely that many psychological variables will achieve this level of equivalence, although some physiological variables, such as birth weight, certainly do.

The Nature of Bias

There have been many definitions of test bias (Flaugher, 1978). Messick’s (1980, 1989, 1995) conceptions of test bias are perhaps the most widely accepted and emerge from the perspective of construct validity. Messick portrayed bias as a specific source of test variance other than the valid variance associated with the desired construct. Bias is associated with some irrelevant variable, such as race, gender, or in the case of cross-cultural testing, culture (or perhaps country of origin). van de Vijver and his associates (van de Vijver, 2000; van de Vijver & Leung, 1997; van de Vijver & Poortinga, 1997) have perhaps best characterized bias within the context of cross-cultural assessment. “Test bias exists, from this viewpoint, when an existing test does not measure the equivalent underlying psychological construct in a new group or culture, as was measured within the original group in which it was standardized” (Allen & Walsh, 2000, p. 67). van de Vijver describes bias as “a generic term for all nuisance factors threatening the validity of cross-cultural comparisons. Poor item translations, inappropriate item content, and lack of standardization in administration procedures are just a few examples” (van de Vijver & Leung, p. 10). He also describes bias as “a lack of similarity of psychological meaning of test scores across cultural groups” (van de Vijver, p. 88).

The term bias, as can be seen, is very closely related to conceptual equivalence. van de Vjver describes the distinction as follows: “The two concepts are strongly related, but have a somewhat different connotation. Bias refers to the presence or absence of validity-threatening factors in such assessment, whereas equivalence involves the consequences of these factors on the comparability of test scores” (van de Vijver, 2000, p. 89). In his various publications, van de Vijver identifies three groupings of bias: construct, method, and item bias. Each is described in the following sections.

Construct Bias

Measures that do not examine the same construct across cultural groups exhibit construct bias, which clearly is highly related to the notion of conceptual equivalence previously described. Contstruct bias would be evident in a measure that has been (a) factor analyzed in its original culture with the repeated finding of a four-factor solution, and that is then (b) translated to another language and administered to a sample from the culture in which that language is spoken, with a factor analysis of the test results indicating a two-factor and therefore different solution. When the constructs underlying the measures vary across cultures, with culture-specific components of the construct present in some cultures, such evidence is not only likely to result, it should result if both measures are to measure the construct validly in their respective cultures. Construct bias can occur when constructs only partially overlap across cultures, when there is a differential appropriateness of behaviors comprising the construct in different cultures, when there is a poor sampling of relevant behaviors constituting the construct in one or more cultures, or when there is incomplete coverage of all relevant aspects of the construct (van de Vijver, 2000).

An example of construct bias can be seen in the following. In many instances, inventories (such as personality inventories) are largely translated from the original language to a second or target language. If the culture that uses the target language has culture-specific aspects of personality that either do not exist or are not as prevalent as in the original culture, then these aspects will certainly not be translated into the target-language form of the assessment instrument.

The concept of construct bias has implications for both cross-cultural research and cross-cultural psychological practice. Cross-cultural or etic comparisons are unlikely to be very meaningful if the construct means something different in the two or more cultures, or if it is a reasonably valid representation of the construct in one culture but less so in the other. The practice implications in the target language emerge from the fact that the measure may not be valid as a measure of culturally relevant constructs that may be of consequence for diagnosis and treatment.

Method Bias

van de Vijver (2000) has identified a number of types of method bias, including sample, instrument, and administration bias. The different issues composing this type of bias were given the name method bias because they relate to the kinds of topics covered in methods sections of journal articles (van de Vijver). Method biases often affect performance on the assessment instrument as a whole (rather than affecting only components of the measure). Some of the types of method bias are described in the following.

Sample Bias

In studies comparing two or more cultures, the samples from each culture may differ on variables related to test-relevant background characteristics. These differences may affect the comparison. Examples of such characteristics would include fluency in the language in which testing occurs, general education levels, and underlying motivational levels (van de Vijver, 2000). Imagine an essay test that is administered in a single language. Two groups are compared: Both have facility with the language, but one has substantially more ability. Regardless of the knowledge involved in the answers, it is likely that the group that is more facile in the language will provide better answers on average because of their ability to employ the language better in answering the question.

Instrument Bias

This type of bias is much like sample bias, but the groups being tested tend to differ in less generic ways that are more specific to the testing method itself, as when a test subject has some familiarity with the general format of the testing or some other form of test sophistication. van de Vijver (2000) states that the most common forms of this bias exist when groups differ by response styles in answering questions, or by their familiarity with the materials on the test. As is described later in this research paper, attempts to develop culture-fair or culture-free intelligence tests (e.g., Bracken, Naglieri, & Baardos, 1999; Cattell, 1940; Cattell & Cattell, 1963) have often used geometric figures rather than language in the effort to avoid dependence upon language. Groups that differ in educational experience or by culture also may have differential exposure to geometric figures. This differential contact with the stimuli composing the test may bias the comparison in a manner that is difficult to disentangle from the construct of intelligence measured by the instrument.

Alternatively, different cultures vary in the tendency of their members to disclose personal issues about themselves. When two cultures are compared, depending upon who is making the comparison, it is possible that one group could look overly self-revelatory or the other too private.

Imagine the use of a measure such as the Thematic Apperception Test (TAT) in a cross-cultural comparison. Not only do the people pictured on many of the TAT cards not look like persons from some cultures, but the scenes themselves have a decidedly Western orientation. Respondents from some cultures would obviously find such pictures more foreign and strange.

Geisinger (1994) recommended the use of enhanced testpractice exercises to attempt to reduce differences in testformat familiarity. Such exercises could be performed at the testing site or in advance of the testing, depending upon the time needed to gain familiarity.

Administration Bias

The final type of method bias emerges from the interactions between test-taker or respondent and the individual administering the test, whether the test, questionnaire, or interview, is individually administered or is completed in more of a largegroup situation. Such biases could come from language problems on the part of an interviewer, who may be conducting the interview in a language in which he or she is less adept than might be ideal (van de Vijver & Leung, 1997). Communications problems may result from other language problems—for example, the common mistakes individuals often make in second languages in regard to the use of the familiar second person.

Another example of administration bias may be seen in the multicultural testing literature in the United States. The theory of stereotype threat (Steele, 1997; Steele & Aronson, 1995) suggests that African Americans, when taking an individualized intellectual assessment or other similar measure that is administered by someone whom the African American test-takers believe holds negative stereotypes about them, will perform at the level expected by the test administrator. Steele’s theory holds that negative stereotypes can have a powerful influence on the results of important assessments—stereotypes that can influence test-takers’ performance and, ultimately, their lives. Of course, in a world where there are many cultural tensions and cultural preconceptions, it is possible that this threat may apply to groups other than Whites and Blacks in the American culture. van de Vijver (2000) concludes his chapter, however, with the statement that with notable exceptions, responses to either interviews or most cognitive tests do not seem to be strongly affected by the cultural, racial, or ethnic status of administrators.

Item Bias

In the late 1970s, those involved in large-scale testing, primarily psychometricians, began studying the possibility that items could be biased—that is, that the format or content of items could influence responses to individual items on tests in unintended ways. Because of the connotations of the term biased, the topic has more recently been termed differential item functioning, or dif (e.g., Holland & Wainer, 1993). van de Vijver and his associates prefer continuing to use the term item bias, however, to accentuate the notion that these factors are measurement issues that, if not handled appropriately, may bias the results of measurement. Essentially, on a cognitive test, an item is biased for a particular group if it is more difficult for individuals of a certain level of overall ability than it is for members of the other group who have that same level of overall ability. Items may appear biased in this way because they deal with content that is not uniformly available to members of all cultures involved. They may also be identified as biased because translations have not been adequately performed.

In addition, there may be factors such as words describing concepts that are differentially more difficult in one language than the other.Anumber of studies are beginning to appear in the literature describing the kinds of item problems that lead to differential levels of difficulty (e.g., Allalouf, Hambleton, & Sireci, 1999; Budgell, Raju, & Quartetti, 1995; Hulin, 1987; Tanzer, Gittler, & Sim, 1994). Some of these findings may prove very useful for future test-translation projects, and may even influence the construction of tests that are likely to be translated. For example, in an early study of Hispanics and Anglos taking the Scholastic Aptitude Test, Schmitt (1988) found that verbal items that used roots common to English and Spanish appeared to help Hispanics. Limited evidence suggested that words that differed in cognates (words that appear to have the same roots but have different meanings in both languages) and homographs (words spelled alike in both languages but with different meanings in the two) caused difficulties for Hispanic test-takers. Allalouf et al. found that on a verbal test for college admissions that had been translated from Hebrew to Russian, analogy items presented the most problems. Most of these difficulties (items identified as differentially more difficult) emerged from word difficulty, especially in analogy items. Interestingly, many of the analogies were easier in Russian, the target language. Apparently, the translators chose easier words, thus making items less difficult. Sentence completion items had been difficult to translate because of different sentence structures in the two languages. Reading Comprehension items also lead to some problems, mostly related to the specific content of the reading passages in question. In some cases, the content was seen as culturally specific. Allalouf et al. also concluded that differences in item difficulty can emerge from differences in wording, differences in content, differences in format, or differences in cultural relevance.

If the responses to questions on a test are not objective in format, differences in scoring rubrics can also lead to item bias. Budgell et al. (1995) used an expert committee to review the results of statistical analyses that identified items as biased; in many cases, the committee could not provide logic as to why an item translated from English to French was differentially more difficult for one group or the other.

Item bias has been studied primarily in cognitive measures: ability and achievement tests. van de Vijver (2000) correctly notes that measures such as the Rorschach should also be evaluated for item bias. It is possible that members of different cultures could differentially interpret the cards on measures such as the Rorschach or the TAT.

The Translation and Adaptation of Tests

During the 1990s, considerable attention was provided to the translation and adaptation of tests and assessment devices in the disciplines of testing and psychometrics and in the literature associated with these fields. The International Testing Commission developed guidelines that were shared in various draft stages (e.g., Hambleton, 1994, 1999; van de Vijver & Leung, 1997; the resulting guidelines were a major accomplishment for testing and for cross-cultural psychology, and they are provided as the appendix to this research paper). Seminal papers on test translation (Berry, 1980; Bracken & Barona, 1991; Brislin, 1970, 1980, 1986; Butcher & Pancheri, 1976; Geisinger, 1994; Hambleton, 1994; Lonner, 1979; Werner & Campbell, 1970) appeared and helped individuals faced with the conversion of an instrument from one language and culture to another. The following sections note some of the issues to be faced regarding language and culture, and then provide a brief description of the approaches to test translation.

The Role of Language

One of the primary ways that cultures differ is through language. In fact, in considering the level of acculturation of individuals, their language skills are often given dominant (and sometimes mistaken) importance. Even within countries and regions of the world where ostensibly the same language is spoken, accents can make oral communication difficult. Language skill is, ultimately, the ability to communicate. There are different types of language skills.Whereas some language scholars consider competencies in areas such as grammar, discourse, language strategy, and sociolinguistic facility (see Durán, 1988), the focus of many language scholars has been on more holistic approaches to language. Generally, language skills may be considered along two dimensions, one being the oral and written dimension, and the other being the understanding and expression dimension.

Depending upon the nature of the assessment to be performed, different kinds and qualities of language skills may be needed. Academically oriented, largely written language skills may require 6 to 8 years of instruction and use to develop, whereas the development of the spoken word for everyday situations is much faster. These issues are of critical importance for the assessment of immigrants and their children (Geisinger, 2002; Sandoval, 1998). Some cross-cultural comparisons are made using one language for both groups, even though the language may be the second language for one of the groups. In such cases, language skills may be a confounding variable. In the United States, the issue of language often obscures comparisons of Anglos and Hispanics. Pennock-Roman (1992) demonstrated that English-language tests for admissions to higher education may not be valid for language minorities when their English-language skills are not strong.

Culture and language may be very closely wedded. However, not all individuals who speak the same language come from the same culture or are able to take the same test validly. Also within the United States, the heterogeneous nature of individuals who might be classified as Hispanic Americans is underscored by the need for multiple Spanish-language translations and adaptations of tests (Handel & Ben-Porath, 2000). For example, the same Spanish-language measure may not be appropriate for Mexicans, individuals from the Caribbean, and individuals from the Iberian Peninsula. Individuals from different Latin American countries may also need different instruments.

Measurement of Language Proficiency

Before making many assessments, we need to establish whether the individuals being tested have the requisite levels of language skills that will be used on the examination. We also need to develop better measures of language skills (Durán, 1988). In American schools, English-language skills should be evaluated early in the schooling of a child whose home language is not English to determine whether that child can profit from English-language instruction (Durán). Handel and Ben-Porath (2000) argue that a similar assessment should be made prior to administration of the Minnesota Multiphasic Personality Inventory–2 (MMPI-2) because studies have shown that it does not work as effectively with individuals whose language skills are not robust, and that the validity of at least some of the scales is compromised if the respondent does not have adequate English reading ability. Many school systems and other agencies have developed tests of language skills, sometimes equating tests of English- and Spanishlanguage skills so that the scores are comparable (e.g., O’Brien, 1992; Yansen & Shulman, 1996).

The Role of Culture

One of the primary and most evident ways that cultures often differ is by spoken and written language. They also differ, of course, in many other ways. Individuals from two different cultures who take part in a cross-cultural investigation are likely to differ according to other variables that can influence testing and the study. Among these are speed of responding, the amount of education that they have received, the nature and content of that schooling, and their levels of motivation. All of these factors may influence the findings of crosscultural studies.

Culture may be envisioned as an antecedent of behavior. Culture is defined as a set of contextual variables that have either a direct or an indirect effect on thought and behavior (Cuéllar, 2000). Culture provides a context for all human behavior, thought, and other mediating variables, and it is so pervasive that it is difficult to imagine human behavior that might occur without reference to culture. As noted at the beginning of this research paper, one of the goals of cross-cultural psychology is to investigate and differentiate those behaviors, behavior patterns, personalities, attitudes, worldviews, and so on that are universal from those that are culture specific (see van de Vijver & Poortinga, 1982). “The current field of cultural psychology represents a recognition of the cultural specificity of all human behavior, whereby basic psychological processes may result in highly diverse performance, attitude, self-concepts, and world views in members of different cultural populations” (Anastasi & Urbina, 1997, p. 341). Personality has sometimes been defined as an all-inclusive characteristic describing one’s patterns of responding to others and to the world. It is not surprising that so many anthropologists and cross-cultural psychologists have studied the influence of culture and personality—the effects of the pervasive environment on the superordinate organization that mediates behavior and one’s interaction with the world.

For much of the history of Western psychology, investigators and theorists believed that many basic psychological processes were universal, that is, that they transcended individual cultures (e.g., Moreland, 1996; Padilla & Medina, 1996). More psychologists now recognize that culture has the omnipresent impact. For example, the APA’s current Diagnostic and Statistical Manual of Mental Disorders (fourth edition, or DSM-IV) was improved over its predecessor, DSM-III-R, by “including an appendix that gives instructions on how to understand culture and descriptions of culture-bound syndromes” (Keitel, Kopala, & Adamson, 1996, p. 35).

Malgady (1996) has extended this argument. He has stated that we should actually assume cultural nonequivalence rather than cultural equivalence. Of course, much of the work of cross-cultural psychologists and other researchers is determining whether our measures are equivalent and appropriate (not biased). Clearly, if we are unable to make parallel tests that are usable in more than one culture and whose scores are comparable in the varying cultures, then we do not have an ability to compare cultures. Helms (1992), too, has asked this question with a particular reference to intelligence testing, arguing that intelligence tests are oriented to middle-class, White Americans.

What about working with individuals with similarappearing psychological concerns in different cultures, especially where a useful measure has been identified in one culture? Can we use comparable, but not identical, psychological measures in different cultures? Indeed, we should probably use culture-specific measures, even though these measures cannot be used for cross-cultural comparisons (van de Vijver, 2000). If we use measures that have been translated or adapted from other cultures, we need to revalidate them in the new culture. In some circumstances, we may also need to use assessments of acculturation as well as tests of language proficiency before we use tests with clients requiring assessment. (See Geisinger, 2002, for a description of such an assessment paradigm.)

There are problems inherent in even the best translations of tests. For example, even when professional translators, content or psychological specialists, and test designers are involved, “direct translations of the tests are often not possible as psychological constructs may have relevance in one culture and not in another. . . . Just because the content of the items is preserved does not automatically insure that the item taps the same ability within the cultural context of the individual being tested” (Suzuki, Vraniak, & Kugler, 1996). This lack of parallelism is, of course, what has already been seen as construct bias and a lack of conceptual equivalence. Brislin (1980) has also referred to this issue as translatability. A test is poorly translated when salient characteristics in the construct to be assessed cannot be translated. The translation and adaptation of tests is considered later in this section.

Intelligence tests are among the most commonly administered types of tests. Kamin (1974) demonstrated in dramatic form how tests of intelligence were once used in U.S. governmental decision making concerning the status of immigrants. In general, immigration policies were affected by analyses of the intelligence of immigrants from varying countries and regions of the world. Average IQs were computed from country and region of origin; these data were shared widely and generally were believed to be the results of innate ability, even though one of the strongest findings of the time was that the longer immigrants lived in the United States, the higher their levels of intelligence were (Kamin). The tests used for many of these analyses were the famous Army Alpha and Army Beta, which were early verbal and nonverbal tests of intelligence, respectively. It was obvious even then that language could cause validity problems in the intelligence testing of those whose English was not proficient. Leaders in the intelligence-testing community have also attempted to develop tests that could be used cross-culturally without translation. These measures have often been called culture-free or culture-fair tests of intelligence.

Culture-Free and Culture-Fair Assessments of Intelligence

Some psychologists initially attempted to develop so-called culture-free measures of intelligence. In the 1940s, for example, Cattell (1940) attempted to use very simple geometric forms that were not reliant upon language to construct what he termed culture-free tests. These tests were based on a notion that Cattell (1963) conceptualized and later developed into a theory that intelligence could be decomposed into two types of ability: fluid and crystallized mental abilities. Fluid abilities were nonverbal and involved in adaptation and learning capabilities. Crystallized abilities were developed as a result of the use of fluid abilities and were based upon cultural assimilation (Sattler, 1992). These tests, then, were intended to measure only fluid abilities and, according to this theory, would hence be culture-free: that is, implicitly conceptually equivalent.

It was soon realized that it was not possible to eliminate the effects of culture from even these geometric-stimulus-based, nonverbal tests. “Even those designated as ‘culture-free’ do not eliminate the effects of previous cultural experiences, both of impoverishment and enrichment. Language factors greatly affect performance, and someof the tasks used to measure intelligence have little or no relevance for cultures very different from the Anglo-European” (Ritzler, 1996, p. 125). Nonlanguage tests may even be more culturally loaded than language-based tests. Larger group differences with nonverbal tests than with verbal ones have often been found. “Nonverbal, spatial-perceptual tests frequently require relatively abstract thinking processes and analytic cognitive styles characteristic of middle-class Western cultures” (Anastasi & Urbina, 1997, p. 344). In retrospect, “cultural influences will and should be reflected in test performance. It is therefore futile to try to devise a test that is free from cultural influences” (Anastasi & Urbina, p. 342).

Noting these and other reactions, Cattell (Cattell & Cattell, 1963) tried to balance cultural influences and build what he termed culture-fair tests. These tests also tend to use geometric forms of various types. The items frequently are complex patterns, classification tasks, or solving printed mazes; and although the tests can be paper-and-pencil, they can also be based on performance tasks and thus avoid language-based verbal questions. They may involve pictures rather than verbal stimuli. Even such tests were not seen as fully viable:

It is unlikely, moreover, that any test can be equally “fair” to more than one cultural group, especially if the cultures are quite dissimilar. While reducing cultural differentials in test performance, cross-cultural tests cannot completely eliminate such differentials. Every test tends to favor persons from the culture in which it was developed. (Anastasi & Urbina, 1997, p. 342)

Some cultures place greater or lesser emphases upon abstractions, and some cultures value the understanding of contexts and situations more than Western cultures (Cole & Bruner, 1971).

On the other hand, there is a substantial literature that suggests culture-fair tests like the Cattell fulfill not only theoretical and social concerns but practical needs as well. . . . Smith, Hays, and Solway (1977) compared the Cattell Culture-Fair Test and the WISC-R in a sample of juvenile delinquents, 53% of whom were black or Mexican-Americans. . . . The authors concluded that the Cattell is a better measure of intelligence for minority groups than the WISC-R, as it lessens the effect of cultural bias and presents a “more accurate” picture of their intellectual capacity. (Domino, 2000, p. 300)

Some of our top test developers continue to develop tests intended to be culture-fair (Bracken et al., 1999). Although such measures may not be so culture-fair that they would permit cross-cultural comparisons that would be free of cultural biases, they nevertheless have been used effectively in a variety of cultures and may be transported from culture to culture without many of the translation issues so incumbent on most tests of ability that are used in more than one culture. Such tests should, however, be evaluated carefully for what some have seen as their middle-class, Anglo-European orientation.

Acculturation

Cuéllar (2000) has described acculturation as a moderator variable between personality and behavior, and culture as “learned behavior transmitted from one generation to the next” (p.115). When an individual leaves one culture and joins a second, a transition is generally needed. This transition is, at least in part, acculturation. “Most psychological research defines the construct of acculturation as the product of learning due to contacts between the members of two or more groups” (Marín, 1992, p. 345). Learning the language of the new culture is only one of the more obvious aspects of acculturation. It also involves learning the history and traditions of the new culture, changing personally meaningful behaviors in one’s life (including the use of one’s language), and changing norms, values, worldview, and interaction patterns (Marín).

In practice settings, when considering the test performance of an individual who is not from the culture in which the assessment instrument was developed, one needs to consider the individual’s level of acculturation. Many variables have been shown to be influenced by the level of acculturation in the individual being assessed. Sue, Keefe, Enomoto, Durvasula, and Chao (1996), for example, found that acculturation affected scales on the MMPI-2. It has also been shown that one’s level of acculturation affects personality scales to the extent that these differences could lead to different diagnoses and, perhaps, hospitalization decisions (Cuéllar, 2000).

Keitel et al. (1996) have provided guidelines for conducting ethical multicultural assessments. Included among these guidelines are assessing acculturation level, selecting tests appropriate for the culture of the test taker, administering tests in an unbiased fashion, and interpreting results appropriately and in a manner that a client can understand. Dana (1993) and Moreland (1996) concur that acculturation should be assessed as a part of an in-depth evaluation. They suggest, as well, that the psychologist first assess an individual’s acculturation and then use instruments appropriate for the individual’s dominant culture. Too often, they fear, psychologists use instruments from the dominant culture and with which the psychologist is more likely to be familiar. They also propose that a psychologist dealing with a client who is not fully acculturated should consider test results with respect to the individual’s test sophistication, motivation, and other psychological factors that may be influenced by the level of his or her acculturation. Because of the importance of learning to deal with clients who are not from dominant cultures, it has been argued that in training psychologists and other humanservice professionals, practicums should provide students with access to clients from different cultures (Geisinger & Carlson, 1998; Keitel et al., 1996).

There are many measures of acculturation. Measurement is complex, in part because it is not a unidimensional characteristic (even though many measures treat it as such). Discussion of this topic is beyond the scope of the present research paper; however, the interested reader is referred to Cuéllar (2000), Marín (1992), or Olmeda (1979).

Approaches to Test Adaptation and Translation

The translation and adaptation of tests was one of the most discussed testing issues in the 1990s. The decade ended with a major conference held in Washington, DC, in 1999 called the “International Conference on Test Adaptation: Adapting Tests for Use in Multiple Languages and Cultures.” The conference brought together many of the leaders in this area of study for an exchange of ideas. In a decade during which many tests had been translated and adapted, and some examples of poor testing practice had been noted, one of the significant developments was the publication of the International Test Commission guidelines on the adapting of tests. These guidelines, which appear as the appendix to this research paper, summarize some of the best thinking on test adaptation. They may also be found in annotated form in Hambleton (1999) and van de Vijver and Leung (1997). The term test adaptation also took prominence during the last decade of the twentieth century; previously, the term test translation had been dominant. This change was based on the more widespread recognition that changes to tests were needed to reflect both cultural differences and language differences. These issues have probably long been known in the cross-cultural psychology profession, but less so in the field of testing. (For excellent treatments on the translation of research materials, see Brislin, 1980, 1986.)

There are a variety of qualitatively different approaches to test adaptation. Of course, for some cross-cultural testing projects, one might develop a new measure altogether to meet one’s needs. Such an approach is not test adaptation per se, but nonetheless would need to follow many of the principles of this process. Before building a test for use in more than one culture, one would need to ask how universal the constructs to be tested are (Anastasi & Urbina, 1997; van de Vijver & Poortinga, 1982). One would also have to decide how to validate the measure in the varying cultures. If one imagines a simple approach to validation (e.g., the criterion-related approach), one would need equivalent criteria in each culture. This requirement is often formidable.Amore common model is to take an existing and generally much-used measure from one culture and language to attempt to translate it to a second culture and language.

van de Vijver and Leung (1997) have identified three general approaches to adapting tests: back-translation, decentering, and the committee approach. Each of these is described in turn in the following sections. Prior to the development of any of these general approaches, however, some researchers and test developers simply translated tests from one language to a second. For purposes of this discussion, this unadulterated technique is called direct translation; it has sometimes been called forward translation, but this writer does not prefer that name because the process is not the opposite of the back-translation procedure. The techniques embodied by these three general approaches serve as improvements over the direct translation of tests.

Back-Translation

This technique is sometimes called the translation/backtranslation technique and was an initial attempt to advance the test adaptation process beyond a direct test translation (Brislin, 1970; Werner & Campbell, 1970). In this approach, an initial translator or team of translators alters the test materials from the original language to the target language. Then a second translator or team, without having seen the original test, begins with the target language translation, and renders this form back to the original language. At this point, the original test developer (or the individuals who plan to use the translated test, or their representatives) compares the original test with the back-translated version, both of which are in the original language. The quality of the translation is evaluated in terms of how accurately the back-translated version agrees with the original test. This technique was widely cited as the procedure of choice (e.g., Butcher & Pancheri, 1976) for several decades and it has been very useful in remedying certain translation problems (van de Vijver & Leung, 1997). It may be especially useful if the test user or developer lacks facility in the target language. It also provides an attempt to evaluate the quality of the translation. However, it also has other disadvantages. The orientation is on a language-only translation; there is no possibility of changes in the test to accommodate cultural differences. Thus, if there are culture-specific aspects of the test, this technique should generally not be used. In fact, this technique can lead to special problems if the translators know that their work will be evaluated through a back-translation procedure. In such an instance, they may use stilted language or wording to insure an accurate back-translation rather than a properly worded translation that would be understood best by test takers in the target language. In short, “a translation-back translation procedure pays more attention to the semantics and less to connotations, naturalness, and comprehensibility” (van de Vijver & Leung, 1997, p. 39).

Decentering

The process of culturally decentering test materials is somewhat more complex than either the direct translation or translation/back-translation processes (Werner & Campbell, 1970). Cultural decentering does involve translation of an instrument from an original language to a target language. However, unlike direct translation, the original measure is changed prior to being adapted (or translated) to improve its translatability; those components of the test that are likely to be specific to the original culture are removed or altered. Thus, the cultural biases, both construct and method, are reduced. In addition, the wording of the original measure may be changed in a way that will enhance its translatability. The process is usually performed by a team composed of multilingual, multicultural individuals who have knowledge of the construct to be measured and, perhaps, of the original measure (van de Vijver & Leung, 1997). This team then changes the original measure so that “there will be a smooth, naturalsounding version in the second language” (Brislin, 1980, p. 433). If decentering is successful, the two assessment instruments that result, one in each language, are both generally free of culture-specific language and content. “Tanzer, Gittler, and Ellis (1995) developed a test of spatial ability that was used in Austria and the United States. The instructions and stimuli were simultaneously in German and English” (van de Vijver & Leung, 1997, pp. 39–40).

There are several reasons that cultural decentering is not frequently performed, however. First, of course, is that the process is time consuming and expensive. Second, data collected using the original instrument in the first language cannot be used as part of cross-cultural comparisons; only data from the two decentered methods may be used. This condition means that the rich history of validation and normative data that may be available for the original measure are likely to have little use, and the decentered measure in the original language must be used in regathering such information for comparative purposes. For this reason, this process is most likely to be used in comparative cross-cultural research when there is not plentiful supportive data on the original measure. When the technique is used, it is essentially two testconstruction processes.

The Committee Approach

This approach was probably first described by Brislin (1980), has been summarized by van de Vijver and Leung (1997), and is explained in some detail by Geisinger (1994). In this method, a group of bilingual individuals translates the test from the original language to the target language. The members of the committee need to be not only bilingual, but also thoroughly familiar with both cultures, with the construct(s) measured on the test, and with general testing principles. Like most committee processes, this procedure has advantages and disadvantages. A committee will be more expensive than a single translator. A committee may not work well together, or may be dominated by one or more persons. Some members of the committee may not contribute fully or may be reticent to participate for personal reasons. On the other hand, members of the committee are likely to catch mistakes of others on the committee (Brislin, 1980). It is also possible that the committee members can cooperate and help each other, especially if their expertise is complementary (van de Vijver & Leung, 1997). This method, however, like the decentering method, does not include an independent evaluation of its effectiveness. Therefore, it is useful to couple the work of a committee with a back-translation.

Rules for Adapting Test and Assessment Materials

Brislin (1980, p. 432) provided a listing of general rules for developing research documents and instruments that are to be translated. These are rules that generate documents written in English that are likely to be successfully translated or adapted, similar to decentering. Most appear as rules for good writing and effective communication, and they have considerable applicability. These 12 rules have been edited slightly for use here.

Use short, simple sentences of fewer than 16 words.
Employ active rather than passive words.
Repeat nouns instead of using pronouns.
Avoid metaphors and colloquialisms. Such phrases are least likely to have equivalents in the target language.
Avoid the subjunctive mood (e.g., verb forms with could or would).
Add sentences that provide context for key ideas. Reword key phrases to provide redundancy. This rule suggests that longer items and questions be used only in single-country research.
Avoid adverbs and prepositions telling where or when (e.g., frequently, beyond, around).
Avoid possessive forms where possible.
Use specific rather than general terms (e.g., the specific animal name, such as cows, chickens, or pigs, rather than the general term livestock).
Avoid words indicating vagueness regarding some event or thing (e.g., probably, frequently).
Use wording familiar to the translators where possible.
Avoid sentences with two different verbs if the verbs suggest two different actions.

Steps in the Translation and Adaptation Process

Geisinger (1994) elaborated 10 steps that should be involved in any test-adaptation process. In general, these steps are an adaptation themselves of any test-development project. Other writers have altered these procedural steps to some extent, but most modifications are quite minor. Each step is listed and annotated briefly below.

Translate and adapt the measure. “Sometimes an instrument can be translated or adapted on a question-byquestion basis. At other times, it must be adapted and translated only in concept” (Geisinger, 1994, p. 306). This decision must be made based on the concept of whether the content and constructs measured by the test are free from construct bias.The selection of translators is a major factor in the success of this stage, and Hambleton (1999) provides good suggestions in this regard. Translators must be knowledgeable about the content covered on the test, completely bilingual, expert about both cultures, and often able to work as part of a team.
Review the translated or adapted version of the instrument. Once the measure has been adapted, the quality of the new document must be judged. Back-translation can be employed at this stage, but it may be more effective to empanel individual or group reviews of the changed document. Geisinger (1994) suggested that members of the panel review the test individually in writing, share their comments with one another, and then meet to resolve differences of opinion and, perhaps, to rewrite portions of the draft test. The individual participants in this process must meet a number of criteria. They must be fluent in both languages and knowledgeable about both cultures. They must also understand the characteristics measured with the instrument and the likely uses to which the test is to be put. If they do not meet any one of these criteria, their assessment may be flawed.
Adapt the draft instrument on the basis of the comments of the reviewers. The individuals involved in the translation or adaptation process need to receive the feedback that arose in Step 2 and consider the comments. There may be reasons not to follow some of the suggestions of the review panel (e.g., reasons related to the validity of the instrument), and the original test author, test users, and the translator should consider these comments.
Pilot-test the instrument. It is frequently useful to have a very small number of individuals who can take the test and share concerns and reactions that they may have. They should be as similar as possible to the eventual test

takers, and they should be interviewed (or should complete a questionnaire) after taking the test. They may be able to identify problems or ambiguities in wording, in the instructions, in timing, and so on. Any changes that are needed after the pilot test should be made, and if these alterations are extensive, the test may need to be pilot-tested once again.

Field-test the instrument. This step differs from the pilot test in that it involves a large and representative sample. If the population taking the test in the target language is diverse, all elements of that diversity should be represented and perhaps overrepresented. After collection of these data, the reliability of the test should be assessed and item analyses performed. Included in the item analyses should be analyses for item bias (both as compared to the original-language version and, perhaps, across elements of diversity within the target field-testing sample). van de Vijver and Poortinga (1991) describe some of the analyses that should be performed on an item-analysis basis.
Standardize the scores. If desirable and appropriate, equate them with scores on the original version. If the sample size is large enough, it would be useful (and necessary for tests to be used in practice) to establish norms. If the field-test sample is not large enough, and the test is to be used for more than cross-cultural research in the target language, then collection of norm data is necessary. Scores may be equated back to the score scale of the original instrument, just as may be performed for any new form of a test. These procedures are beyond the scope of the present research paper, but may be found in Angoff (1971), Holland and Rubin (1982), or Kolen and Brennan (1995).
Perform validation research as needed. The validation research that is needed includes at least research to establish the equivalence to the original measure. However, as noted earlier, the concepts of construct validation represent the ideal to be sought (Embretson, 1983). Some forms of appropriate revalidation are needed before the test can be used with clients in the target language. It is appropriate to perform validation research before the test is used in research projects, as well.
Develop a manual and other documents for users of the assessment device. Users of this newly adapted instrument are going to need information so that they may employ it effectively. A manual that describes administration, scoring, and interpretation should be provided. To provide information that relates to interpretation, summarization of norms, equating (if any), reliability analyses, validity analyses, and investigations of bias should all be provided. Statements regarding the process of adaptation should be also included.
Train users. New users of any instrument need instruction so that they may use it effectively. There may be special problems associated with adapted instruments because users may tend to use materials and to employ knowledge that they have of the original measure. Although transfer of training is often positive, if there are differences between the language versions negative consequences may result.
Collect reactions from users.Whether the instrument is to beusedforcross-culturalresearchorwithactualclients,it behooves the test adaptation team to collect the thoughts ofusers(andperhapsoftesttakersaswell)andtodosoon a continuing basis. As more individuals take the test, the different experiential backgrounds present may identify concerns. Such comments may lead to changes in future versions of the target-language form.

Methods of Evaluating Test Equivalence

Once a test has been adapted into a target language, it is necessary to establish that the test has the kinds of equivalence that are needed for proper test interpretation and use. Methodologists and psychometricians have worked for several decades on this concern, and a number of research designs and statistical methods are available to help provide data for this analysis, which ultimately informs the testdevelopment team to make a judgment regarding test equivalence. Such research is essential for tests that are to be used with clients in settings that differ from that in which the test was originally developed and validated.

Methods to Establish Equivalence of Scores

Historically, a number of statistical methods have been used to establish the equivalence of scores emerging from a translated test. Four techniques are noted in this section: exploratory factor analysis, structural equation modeling (including confirmatory factor analysis), regression analysis, and item-response theory. Cook, Schmitt, and Brown (1999) provide a far more detailed description of these techniques. Individual items that are translated or adapted from one language to another also should be subjected to item bias (or dif) analyses as well. Holland and Wainer (1993) have provided an excellent resource on dif techniques, and van de Vijver and Leung (1997) devote the better part of an outstanding chapter (pp. 62–88) specifically to the use of item bias techniques.Allalouf et al. (1999) and Budgell et al. (1995) are other fine examples of this methodology in the literature.

Exploratory, Replicatory Factor Analysis

Many psychological tests, especially personality measures, have been subjected to factor analysis, a technique that has often been used in psychology in an exploratory fashion to identify dimensions or consistencies among the items composing a measure (Anastasi & Urbina, 1997). To establish that the internal relationships of items or test components hold across different language versions of a test, a factor analysis of the translated version is performed. A factor analysis normally begins with the correlation matrix of all the items composing the measure. The factor analysis looks for patterns of consistency or factors among the items. There are many forms of factor analysis (e.g., Gorsuch, 1983) and techniques differ in many conceptual ways.Among the important decisions made in any factor analysis are determining the number of factors, deciding whether these factors are permitted to be correlated (oblique) or forced to be uncorrelated (orthogonal), and interpreting the resultant factors.Acomponent of the factor analysis is called rotation, whereby the dimensions are changed mathematically to increase interpretability. The exploratory factor analysis that bears upon the construct equivalence of two measures has been called replicatory factor analysis (RFA; Ben-Porath, 1990) and is a form of cross-validation. In this instance, the number of factors and whether the factors are orthogonal or oblique are constrained to yield the same number of factors as in the original test. In addition, a rotation of the factors is made to attempt to maximally replicate the original solution; this technique is called target rotation. Once these procedures have been performed, the analysts can estimate how similar the factors are across solutions. van de Vijver and Leung (1997) provide indices that may be used for this judgment (e.g., the coefficient of proportionality). Although RFA has probably been the most used technique for estimating congruence (van de Vijver & Leung), it does suffer from a number of problems. One of these is simply that newer techniques, especially confirmatory factor analysis, can now perform a similar analysis while also testing whether the similarity is statistically significant through hypothesis testing. A second problem is that different researchers have not employed standard procedures and do not always rotate their factors to a target solution (van de Vijver & Leung). Finally, many studies do not compute indices of factor similarity across the two solutions and make this discernment only judgmentally (van de Vijver & Leung). Nevertheless, a number of outstanding researchers (e.g., Ben-Porath, 1990; Butcher, 1996) have recommended the use of RFA to establish equivalence and this technique has been widely used, especially in validation efforts for various adaptations of the frequently translated MMPI and the Eysenck Personality Questionnaire.

Regression

Regression approaches are generally used to establish the relationships between the newly translated measure and measures with which it has traditionally correlated in the original culture. The new test can be correlated statistically with other measures, and the correlation coefficients that result may be compared statistically with similar correlation coefficients found in the original population. There may be one or more such correlated variables. When there is more than one independent variable, the technique is called multiple regression. In this case, the adapted test serves as the dependent variable, and the other measures as the independent variables. When multiple regression is used, the independent variables are used to predict the adapted test scores. Multiple regression weights the independent variables mathematically to optimally predict the dependent variable. The regression equation for the original test in the original culture may be compared with that for the adapted test; where there are differences between the two regression lines, whether in the slope or the intercept, or in some other manner, bias in the testing is often presumed.

If the scoring of the original- and target-language measures is the same, it is also possible to include cultural group membership in a multiple regression equation. Such a nominal variable is added as what has been called dummy-coded variable. In such an instance, if the dummy-coded variable is assigned a weighting as part of the multiple regression equation, indicating that it predicts test scores, evidence of cultural differences across either the two measures or the two cultures may be presumed (van de Vijver & Leung, 1997).

Structural Equation Modeling, Including Confirmatory Factor Analysis

Structural equation modeling (SEM; Byrne, 1994; Loehlin, 1992) is a more general and statistically sophisticated procedure that encompasses both factor analysis and regression analysis, and does so in a manner that permits elegant hypothesis testing. When SEM is used to perform factor analysis, it is typically called a confirmatory factor analysis, which is defined by van de Vijver and Leung (1997) as “an extension of classical exploratory factor analysis. Specific to confirmatory factor analysis is the testing of a priori specified hypotheses about the underlying structure, such as the number of factors, loadings of variables on factors, and factor correlations” (p. 99). Essentially, the results of factor-analytic studies of the measure in the original language are constrained upon the adapted measure, data from the adapted measure analyzed, and a goodness-of-fit statistical test is performed.

Regression approaches to relationships among a number of tests can also be studied with SEM. Elaborate models of relationships among other tests, measuring variables hypothesized and found through previous research to be related to the construct measured by the adapted test, also may be tested using SEM. In such an analysis, it is possible for a researcher to approximate the kind of nomological net conceptualized by Cronbach and Meehl (1955), and test whether the structure holds in the target culture as it does in the original culture. Such a test should be the ideal to be sought in establishing the construct equivalence of tests across languages and cultures.

Item-Response Theory

Item-response theory (IRT) is an alternative to classical psychometric true-score theory as a method for analyzing test data. Allen and Walsh (2000) and van de Vijver and Leung (1997) provide descriptions of the way that IRT may be used to compare items across two forms of a measure that differ by language. Although a detailed description of IRT is beyond the scope of this research paper, the briefest of explanations may provide a conceptual understanding of how the procedure is used, especially for cognitive tests. An item characteristic curve (ICC) is computed for each item. This curve has as the x axis the overall ability level of test takers, and as the y axis, the probability of answering the question correctly. Different IRT models have different numbers of parameters, with one-, twoand three-parameter models most common. These parameters correspond to difficulty, discrimination, and the ability to get theanswercorrectbychance,respectively.TheICCcurvesare plotted as normal ogive curves. When a test is adapted, each translated item may be compared across languages graphically by overlaying the two ICCs as well as by comparing the item parameters mathematically. If there are differences, these may be considered conceptually. This method, too, may be considered as one technique for identifying item bias.

Methods to Establish Linkage of Scores

Once the conceptual equivalence of an adapted measure has been met, researchers and test developers often wish to provide measurement-unit and metric equivalence, as well. For most measures, this requirement is met through the process of test equating. As noted throughout this research paper, merely translating a test from one language to another, even if cultural biases have been eliminated, does not insure that the two different-language forms of a measure are equivalent. Conceptual or construct equivalence needs to be established first. Once such a step has been taken, then one can consider higher levels of equivalence. The mathematics of equating may be found in a variety of sources (e.g., Holland & Rubin, 1982; Kolen & Brennan, 1995), and Cook et al. (1999) provide an excellent integration of research designs and analysis for test adaptation; research designs for such studies are abstracted in the following paragraphs.

Sireci (1997) clarified three experimental designs that can be used to equate adapted forms to their original-language scoring systems and, perhaps, norms. He refers to them as (a) the separate-monolingual-groups design, (b) the bilingualgroup design, and (c) the matched-monolingual-groups design. A brief description of each follows.

Separate-Monolingual-Groups Design

In the separate-monolingual-groups design, two different groups of test takers are involved, one from each language or cultural group.Although some items may simply be assumed to be equivalent across both tests, data can be used to support this assumption. These items serve as what is known in equating as anchor items. IRT methods are then generally used to calibrate the two tests to a common scale, most typically the one used by the original-language test (Angoff & Cook, 1988; O’Brien, 1992; Sireci, 1997). Translated items must then be evaluated for invariance across the two differentlanguage test forms; that is, they are assessed to determine whether their difficulty differs across forms. This design does not work effectively if the two groups actually differ, on average, on the characteristic that is assessed (Sireci); in fact, in such a situation, one cannot disentangle differences in the ability measured from differences in the two measures. The method also assumes that the construct measured is based on a single, unidimensional factor. Measures of complex constructs, then, are not good prospects for this method.

Bilingual-Group Design

In the bilingual-group design, a single group of bilingual individuals takes both forms of the test in counterbalanced order. An assumption of this method is that the individuals in the group are all equally bilingual, that is, equally proficient in each language. In Maldonado and Geisinger (in press), all participants first were tested in both Spanish and English competence to gain entry into the study. Even under such restrictive circumstances, however, a ceiling effect made a true assessment of equality impossible. The problem of finding equally bilingual test takers is almost insurmountable.Also, if knowledge of what is on the test in one language affects performance on the other test, it is possible to use two randomly assigned groups of bilingual individuals (where their level of language skill is equated via randomization). In such an instance, it is possible either to give each group one of the tests or to give each group one-half of the items (counterbalanced) from each test in a nonoverlapping manner (Sireci, 1997). Finally, one must question how representative the equally bilingual individuals are of the target population; thus the external validity of the sample may be questioned.

Matched-Monolingual-Groups Design

This design is conceptually similar to the separatemonolingual-groups design, except that in this case the study participants are matched on the basis of some variable expected to correlate highly with the construct measured. By being matched in this way, the two groups are made more equal, which reduces error. “There are not many examples of the matched monolingual group linking design, probably due to the obvious problem of finding relevant and available matching criteria” (Sireci, 1997, p. 17). The design is nevertheless an extremely powerful one.

Conclusion

Psychology has been critiqued as having a Euro-American orientation (Moreland, 1996; Padilla & Medina, 1996). Moreland wrote,

Koch (1981) suggests that American psychologists . . . are trained in scientific attitudes that Kimble (1984) has characterized as emphasizing objectivity, data, elementism, concrete mechanisms, nomothesis, determinism, and scientific values. Dana (1993) holds that multicultural research and practice should emanate from a human science perspective characterized by the opposite of the foregoing terms: intuitive theory, holism, abstract concepts, idiography, indeterminism, and humanistic values. (p. 53)

Moreland believed that this dichotomy was a false one. Nevertheless, he argued that a balance of the two approaches was needed to understand cultural issues more completely. One of the advantages of cross-cultural psychology is that it challenges many of our preconceptions of psychology. It is often said that one learns much about one’s own language when learning a foreign tongue. The analogy for psychology is clear.

Assessment in cross-cultural psychology emphasizes an understanding of the context in which assessment occurs. The notion that traditional understandings of testing and assessment have focused solely on the individual can be tested in this discipline. Cross-cultural and multicultural testing help us focus upon the broader systems of which the individual is but a part.

Hambleton (1994) stated,

The common error is to be rather casual about the test adaptation process, and then interpret the score differences among the samples or populations as if they were real. This mindless disregard of test translation problems and the need to validate instruments in the cultures where they are used has seriously undermined the results from many cross cultural studies. (p. 242)

This research paper has shown that tests that are adapted for use in different languages and cultures need to be studied for equivalence. There are a variety of types of equivalence: linguistic equivalence, functional equivalence, conceptual or construct equivalence, and metric equivalence. Linguistic equivalence requires sophisticated translation techniques and an evaluation of the effectiveness of the translation. Functional equivalence requires that those translating the test be aware of cultural issues in the original test, in the construct, in the target culture, and in the resultant target test. Conceptual equivalence requires a relentless adherence to a constructvalidation perspective and the conduct of research using data from both original and target tests. Metric equivalence, too, involves careful analyses of the test data. The requirements of metric equivalence may not be met in many situations regardless of how much we would like to use scoring scales from the original test with the target test.

If equivalence is one side of the coin, then bias is the other. Construct bias, method bias and item bias can all influence the usefulness of a test adaptation in detrimental ways. The need for construct-validation research on adapted measures is reiterated; there is no more critical point in this research paper. In addition, however, it is important to replicate the construct validation that had been found in the original culture with the original test. Factor analysis, multiple regression, and structural equation modeling permit researchers to assess whether conceptual equivalence is achieved.

The future holds much promise for cross-cultural psychology and for testing and assessment within that subdiscipline of psychology. There will be an increase in the use of different forms of tests used in both the research and the practice of psychology. In a shrinking world, it is clearer that many psychological constructs are likely to hold for individuals around the world, or at least throughout much of it. Knowledge of research from foreign settings and in foreign languages is much more accessible than in the recent past. Thus, researchers may take advantage of theoretical understandings, constructs, and their measurement from leaders all over the world. In applied settings, companies such as Microsoft are already fostering a world in which tests (such as for software literacy) are available in dozens of languages. Costs of test development are so high that adaptation and translation of assessment materials can make the cost of professional assessment cost-effective even in developing nations, where the benefits of psychological testing are likely to be highest. Computer translations of language are advancing rapidly. As this sentence is being written, we are not yet there; human review for cultural and language appropriateness continues to be needed. Yet in the time it will take for these pages to be printed and read, these words may have already become an anachronism.

The search for psychological universals will continue, as will the search for cultural and language limitations on these characteristics. Psychological constructs, both of major import and of more minor significance, will continue to be found that do not generalize to differentcultures.Thefactthattheworldis shrinking because of advances in travel and communications does not mean we should assume it is necessarily becoming moreWestern—moreAmerican.Todosois,atbest,pejorative.

These times are exciting, both historically and psychometrically. The costs in time and money to develop new tests in each culture are often prohibitive. Determination of those aspects of a construct that are universal and those that are culturally specific is critical. These are new concepts for many psychologists; we have not defined cultural and racial concepts carefully and effectively and we have not always incorporated these concepts into our theories (Betancourt & López, 1993; Helms, 1992). Good procedures for adapting tests are available and the results of these efforts can be evaluated. Testing can help society and there is no reason for any country to hoard good assessment devices. Through the adaptation procedures discussed in this research paper they can be shared.

Appendix

Guidelines of the International Test Commission forAdapting Tests (van de Vijver & Leung, 1997, and Hambleton, 1999)

The initial guidelines relate to the testing context, as follows.

Effects of cultural differences that are not relevant or important to the main purposes of the study should be minimized to the extent possible.
The amount of overlap in the constructs in the populations of interest should be assessed.

The following guidelines relate to test translation or test adaptation.

Instrument developers/publishers should ensure that the translation/adaptation process takes full account of linguistic and cultural differences among the populations for whom the translated/adapted versions of the instrument are intended.
Instrument developers/publishers should provide evidence that the language used in the directions, rubrics, and items themselves as well as in the handbook [is] appropriate for all cultural and language populations for whom the instruments is intended.
Instrument developers/publishers should provide evidence that the testing techniques, item formats, test conventions, and procedures are familiar to all intended populations.
Instrument developers/publishers should provide evidence that item content and stimulus materials are familiar to all intended populations.
Instrument developers/publishers should implement systematic judgmental evidence, both linguistic and psychological, to improve the accuracy of the translation/ adaptation process and compile evidence on the equivalence of all language versions.
Instrument developers/publishers should ensure that the data collection design permits the use of appropriate statistical techniques to establish item equivalence between the different language versions of the instrument.
Instrument developers/publishers should apply appropriate statistical techniques to (a) establish the equivalence of the different versions of the instrument and (b) identify problematic components or aspects of the instrument which may be inadequate to one or more of the intended populations.
Instrument developers/publishers should provide information on the evaluation of validity in all target populations for whom the translated/adapted versions are intended.
Instrument developers/publishers should provide statistical evidence of the equivalence of questions for all intended populations.
Nonequivalent questions between versions intended for different populations should not be used in preparing a common scale or in comparing these populations.

However, they may be useful in enhancing content validity of scores reported for each population separately. [emphasis in original]

The following guidelines relate to test administration.

Instrument developers and administrators should try to anticipate the types of problems that can be expected and take appropriate actions to remedy these problems through the preparation of appropriate materials and instructions.
Instrument administrators should be sensitive to a number of factors related to the stimulus materials, administrationprocedures,andresponsemodesthatcanmoderate the validity of the inferences drawn from the scores.
Those aspects of the environment that influence the administration of an instrument should be made as similar as possible across populations for whom the instrument is intended.
Instrument administration instructions should be in the source and target languages to minimize the influence of unwanted sources of variation across populations.
The instrument manual should specify all aspects of the instrument and its administration that require scrutiny in the application of the instrument in a new cultural context.
The administration should be unobtrusive, and the examiner-examinee interaction should be minimized. Explicit rules that are described in the manual for the instrument should be followed.

The final grouping of guidelines relate to documentation that is suggested or required of the test publisher or user.

When an instrument is translated/adapted for use in another population, documentation of the changes should be provided, along with evidence of the equivalence.
Score differences among samples of populations administered the instrument should not be taken at face value. The researcher has the responsibility to substantiate the differences with other empirical evidence. [emphasis in original]
Comparisons across populations can only be made at the level of invariance that has been established for the scale on which scores are reported.
The instrument developer should provide specific information on the ways in which the sociocultural and ecological contexts of the populations might affect performance on the instrument and should suggest procedures to account for these effects in the interpretation of results.

Bibliography:

Allalouf, A., Hambleton, R. K., & Sireci, S. G. (1999). Identifying the causes of DIF in translated verbal items. Journal of Educational Measurement, 36, 185–198.
Allen, J., & Walsh, J. A. (2000). A construct-based approach to equivalence: Methodologies to cross-cultural/multicultural personality assessment research. In R. H. Dana (Ed.), Handbook of cross-cultural and multicultural personality assessment (pp. 63– 85). Mahwah, NJ: Erlbaum.
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River, NJ: Prentice Hall.
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508– 600). Washington, DC: American Council on Education.
Angoff, W. H., & Cook, L. L. (1988). Equating the scores of the “Prueba de Aptitud Academica” and the “Scholastic Aptitude Test” (Report No. 88-2). New York: College Entrance Examination Board.
Ben-Porath, Y. S. (1990). Cross-cultural assessment of personality: The case of replicatory factor analysis. In J. N. Butcher & C. D. Spielberger (Eds.), Advances in personality assessment (Vol. 8, pp. 27–48). Hillsdale, NJ: Erlbaum.
Berry, J. W. (1980). Introduction to methodology. In H. C. Triandis & J. W. Berry (Eds.), Handbook of cross-cultural psychology (Vol. 2, pp. 1–28). Boston: Allyn and Bacon.
Betancourt, H., & López, S. R. (1993). The study of culture, ethnicity, and race in American psychology. American Psychologist, 48, 629–637.
Bracken, B. A., & Barona, A. (1991). State of the art procedures for translating, validating, and using psychoeducational tests in cross-cultural assessment. School Psychology International, 12, 119–132.
Bracken, B. A., Naglieri, J., & Bardos, A. (1999, May). Nonverbal assessment of intelligence: An alternative to test translation and adaptation. Paper presented at the International Conference on Test Adaptation, Washington, DC.
Brislin, R. W. (1970). Back translation for cross-cultural research. Journal of Cross-Cultural Psychology, 1, 185–216.
Brislin, R. W. (1980). Translation and content analysis of oral and written material. In H. C. Triandis & J. W. Berry (Eds.), Handbook of cross-cultural psychology, Vol. 2: Methodology (pp. 389– 444). Needham Heights, MA:Allyn and Bacon.
Brislin, R.W. (1986).The wording and translation of research instruments. In W. J. Lonner & J. W. Berry (Eds.), Field methods in cross-culturalresearch(pp.137–164).NewberryPark,CA:Sage.
Brislin, R. W. (1993). Understanding culture’s influence on behavior. New York: Harcourt Brace.
Budgell, G. R., Raju, N. S., & Quartetti, D. A. (1995). Analysis of differential item functioning in translated assessment instruments. Applied Psychological Measurement, 19, 309–321.
Butcher, J. N. (1996). Translation and adaptation of the MMPI-2 for international use. In J. N. Butcher (Ed.), International adaptations of the MMPI-2 (pp. 393–411). Minneapolis: University of Minnesota Press.
Butcher, J. N., & Han, K. (1998). Methods of establishing crosscultural equivalence. In J. N. Butcher (Ed.), International adaptations of the MMPI-2: Research and clinical applications (pp. 44–63). Minneapolis: University of Minnesota Press.
Butcher, J. N., & Pancheri, P. (1976). A handbook of crosscultural MMPI research. Minneapolis: University of Minnesota Press.
Byrne, B. M. (1989). A primer of LISREL: Basic applications and programmingforconfirmatoryfactoranalyticmodels.NewYork: Springer.
Byrne, B. M. (1994). Structural equation modeling with EQS and EQS/Windows: Basic concepts, applications and programming. Thousand Oaks, CA: Sage.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validity by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105.
Cattell, R. B. (1940). A culture-free intelligence test. Journal of Educational Psychology, 31, 176–199.
Cattell, R. B. (1963). Theory of fluid and crystallized intelligence: A critical experiment. Journal of Educational Psychology, 54, 1–22.
Cattell, R. B., & Cattell, A. K. S. (1963). A culture fair intelligence test. Champaign, IL: Institute for Personality and Ability Testing.
Clark, L. A. (1987). Mutual relevance of mainstream and crosscultural psychology. Journal of Consulting and Clinical Psychology, 55, 461–470.
Cole, M., & Bruner, J. S. (1971). Cultural differences and inferences about psychological processes. American Psychologist, 26, 867– 876.
Cook, L., Schmitt, A., & Brown, C. (1999, May). Adapting achievement and aptitude tests: A review of methodological issues. Paper presented at the International Conference on Test Adaptation, Washington, DC.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.
Cuéllar, I. (2000). Acculturation as a moderator of personality and psychological assessment. In R. H. Dana (Ed.), Handbook of cross-cultural and multicultural personality assessment (pp. 113–130). Mahwah, NJ: Erlbaum.
Dana, R. H. (1993). Multicultural assessment perspectives for professional psychology. Boston: Allyn and Bacon.
Domino, G. (2000). Psychological testing: An introduction. Upper Saddle River, NJ: Prentice Hall.
Durán, R. P. (1988). Validity and language skills assessment: NonEnglish background students. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 105–127). Hillsdale, NJ: Erlbaum.
Embretson, S. E. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179–193.
Flaugher, R. J. (1978). The many definitions of test bias. American Psychologist, 33, 671–679.
Geisinger, K. F. (1992). Fairness and selected psychometric issues in the psychological assessment of Hispanics. In K. F. Geisinger (Ed.),PsychologicaltestingofHispanics(pp.17–42).Washington, DC:American PsychologicalAssociation.
Geisinger, K. F. (1994). Cross-cultural normative assessment: Translation and adaptation issues influencing the normative interpretation of assessment instruments. Psychological Assessment, 6, 304–312.
Geisinger, K. F. (2002). Testing the members of an increasingly diverse society. In J. F. Carlson & B. B. Waterman (Eds.), Social and personal assessment of school-aged children: Developing interventions for educational and clinical use (pp. 349–364). Boston: Allyn and Bacon.
Geisinger, K. F., & Carlson, J. F. (1998). Training psychologists to assess members of a diverse society. In J. Sandoval, C. L. Frisby, K. F. Geisinger, J. D. Schueneman, & J. R. Grenier (Eds.), Test interpretation and diversity: Achieving equity in assessment (pp. 375–386). Washington, DC: American Psychological Association.
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Erlbaum.
Hambleton, R. K. (1994). Guidelines for adapting educational and psychological tests: A progress report. European Journal of Psychological Assessment, 10, 229–244.
Hambleton, R. K. (1999). Issues, designs, and technical guidelines for adapting tests in multiple languages and cultures (Laboratory of Psychometric and Evaluative Research Report No. 353). Amherst: University of Massachusetts, School of Education.
Handel, R. W., & Ben-Porath, Y. S. (2000). Multicultural assessment with the MMPI-2: Issues for research and practice. In R. H. Dana (Ed.), Handbook of cross-cultural and multicultural personality assessment (pp. 229–245). Mahwah, NJ: Erlbaum.
Helms, J. E. (1992). Why is there no study of cultural equivalence in standardized cognitive ability testing? American Psychologist, 47, 1083–1101.
Holland, P. W., & Rubin, D. B. (Eds.). (1982). Test equating. New York: Academic Press.
Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Erlbaum.
Hulin, C. L. (1987). A psychometric theory of evaluations of item and scale translations: Fidelity across languages. Journal of Cross-Cultural Psychology, 18, 721–735.
Kamin, L. J. (1974). The science and politics of I.Q. Potomac, MD: Erlbaum.
Keitel, M. A., Kopala, M., & Adamson, W. S. (1996). Ethical issues in multicultural assessment. In L. A. Suzuki, P. J. Meller, & J. G. Ponterotto (Eds.), Handbook of multicultural assessment: Clinical, psychological and educational applications (pp. 28– 48). San Francisco: Jossey-Bass.
Kolen, M. J., & Brennan, D. B. (1995). Test equating: Methods and practices. New York: Springer.
Loehlin, J. C. (1992). Latent variable models: An introduction to factor, path, and structural equations analysis. Hillsdale, NJ: Erlbaum.
Lonner, W. J. (1979). Issues in cross-cultural psychology. In A. J. Marsella, Tharp, & T. Ciborowski (Eds.), Perspectives on crosscultural psychology (pp. 17–45). NewYork:Academic Press.
Maldonado, C. Y., & Geisinger, K. F. (in press). Conversion of the Wechsler Adult Intelligence Scale into Spanish: An early test adaptation effort of considerable consequence. In R. K. Hambleton, P. F. Merenda, & C. D. Spielberger (Eds.), Adapting educational and psychological tests for cross-cultural assessment. Hillsdale, NJ: Erlbaum.
Malgady, R. G. (1996). The question of cultural bias in assessment and diagnosis of ethnic minority clients: Let’s reject the null hypothesis. Professional Psychology: Research and Practice, 27, 33–73.
Marín, G. (1992). Issues in the measurement of acculturation among Hispanics. In K. F. Geisinger (Ed.), The psychological testing of Hispanics (pp. 235–272). Washington, DC: American Psychological Association.
Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012–1027.
Messick, S. (1989) Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–104). New York: American Council on Education/Macmillan.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749.
Moreland, K. L. (1996). Persistent issues in multicultural assessment of social and emotional functioning. In L. A. Suzuki, P. J. Meller, & J. G. Ponterotto (Eds.), Handbook of multicultural assessment: Clinical, psychological and educational applications (pp. 51–76). San Francisco: Jossey-Bass.
Nichols, D. S., Padilla, J., & Gomez-Maqueo, E. L. (2000). Issues in the cross-cultural adaptation and use of the MMPI-2. In R. H. Dana (Ed.), Handbook of cross-cultural and multicultural personality assessment (pp. 247–266). Mahwah, NJ: Erlbaum.
O’Brien, M. L. (1992). A Rausch approach to scaling issues in testing Hispanics. In K. F. Geisinger (Ed.), Psychological testing of Hispanics (pp. 43–54). Washington, DC: American Psychological Association.
Olmeda, E. L. (1979). Acculturation: A psychometric perspective. American Psychologist, 34, 1061–1070.
Padilla, A. M. (1992). Reflections on testing: Emerging trends and new possibilities. In K. F. Geisinger (Ed.), The psychological testing of Hispanics (pp. 271–284). Washington, DC: American Psychological Association.
Padilla, A. M., & Medina, A. (1996). Cross-cultural sensitivity in assessment: Using tests in culturally appropriate ways. In L. A. Suzuki, P. J. Meller, & J. G. Ponterotto (Eds.), Handbook of multicultural assessment: Clinical, psychological and educational applications (pp. 3–28). San Francisco: Jossey-Bass.
Pennock-Roman, M. (1992). Interpreting test performance in selective admissions for Hispanic students. In K. F. Geisinger (Ed.), The psychological testing of Hispanics (pp. 99–136).Washington, DC:American PsychologicalAssociation.
Pike, K. L. (1967). Language in relation to a unified theory of the structure of human behavior. The Hague, The Netherlands: Mouton.
Ritzler, B.A. (1996). Projective techniques for multicultural personality assessment: Rorschach, TEMAS and the Early Memories Procedure. In L.A. Suzuki, P. J. Meller, & J. G. Ponterotto (Eds.), Handbook of multicultural assessment: Clinical, psychological and educational applications (pp. 115–136). San Francisco: Jossey-Bass.
Sandoval, J. (1998). Test interpretation in a diverse future. In Sandoval, C. L. Frisby, K. F. Geisinger, J. D. Schueneman, & J. R. Grenier (Eds.), Test interpretation and diversity: Achieving equity in assessment (pp. 387–402). Washington, DC: American Psychological Association.
Sattler, J. M. (1992). Assessment of children (Revised and updated 3rd ed.). San Diego, CA: Author.
Schmitt, A. P. (1988). Language and cultural characteristics that explain differential item functioning for Hispanic examinees on the Scholastic Aptitude Test. Journal of Educational Measurement, 25, 1–13.
Sireci, S. G. (1997). Problems and issues in linking assessments across languages. Educational Measurement: Issues and Practice, 16, 12–17.
Steele, C. M. (1997). A threat in the air: How stereotypes shape intellectual identity and performance. American Psychologist, 52, 613–629.
Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African Americans. Journal of Personality and Social Psychology, 69, 797–811.
Sue, S., Keefe, K., Enomoto, K., Durvasula, R., & Chao, R. (1996). Asian American and White college students’performance on the MMPI-2. In J. N. Butcher (Ed.), International adaptations of the MMPI-2: Research and clinical applications (pp. 206–220). Minneapolis: University of Minnesota Press.
Sundberg, N. D., & Gonzales, L. R. (1981). Cross-cultural and cross-ethnic assessment: Overview and issues. In P. McReynolds (Ed.), Advances in psychological assessment (Vol. 5, pp. 460– 541). San Francisco: Jossey-Bass.
Suzuki, L. A., Vraniak, D. A., & Kugler, J. F. (1996). Intellectual assessment across cultures. In L. A. Suzuki, P. J. Meller, & J. G. Ponterotto (Eds.), Handbook of multicultural assessment: Clinical, psychological and educational applications (pp. 141–178). San Francisco: Jossey-Bass.
Tanzer, N. K., Gittler, G., & Sim, C. Q. E. (1994). Cross-cultural validation of item complexity in an LLTM-calibrated spatial ability test. European Journal of Psychological Assessment, 11, 170–183.
Triandis, H. C., Malpass, R. S., & Davidson, A. (1971). Crosscultural psychology. In B. J. Siegel (Ed.), Biennial review of anthropology (pp. 1–84). Palo Alto, CA: Annual Reviews. van de Vijver, F. (2000). The nature of bias. In R. H. Dana (Ed.), Handbook of cross-cultural and multicultural personality assessment (pp. 87–106). Mahwah, NJ: Erlbaum.
van de Vijver, F., & Leung, K. (1997). Methods and data analysis for cross-cultural research. Thousand Oaks, CA: Sage.
van de Vijver, F., & Poortinga, Y. H. (1982). Cross-cultural generalization and universality. Journal of Cross-Cultural Psychology, 13, 387–408.
vandeVijver,F.,&Poortinga,Y.H.(1991).Testingacrosscultures.In R. K. Hambleton & J. N. Zall (Eds.), Advances in educational and psychological testing (pp. 277–308). Boston: Kluwer Academic.
van de Vijver, F. J. R., & Poortinga, Y. H. (1997). Towards an integrated analysis in cross-cultural assessment. European Journal of Psychological Assessment, 13, 29–37.
Werner, O., & Campbell, D. T. (1970). Translating, working through interpreters, and the problem of decentering. In R. Narroll & R. Cohen (Eds.), A handbook of cultural anthropology (pp. 398– 419). New York: American Museum of Natural History.
Yansen, E. A., & Shulman, E. L. (1996). Language testing: Multicultural considerations. In L. A. Suzuki, P. J. Meller, & J. G. Ponterotto (Eds.), Handbook of multicultural assessment: Clinical, psychological and educational applications (pp. 353–394). San Francisco: Jossey-Bass.