This sample education research paper on standardized tests features: 6500 words (approx. 22 pages) and a bibliography with 17 sources. Browse other research paper examples for more inspiration. If you need a thorough research paper written according to all the academic standards, you can always turn to our experienced writers for help. This is how your paper can get an A! Feel free to contact our writing service for professional assistance. We offer high-quality assignments for reasonable rates.
What is a standardized test? People often think about #2 pencils, stuffy classrooms, and high-stakes tests when they think about standardized testing. We see in the media that standardized tests are a hallmark of the No Child Left Behind (2001) era. As such, much of the coverage of standardized tests involves accusations of bias or arguments that tests take time away from teaching. We refer the reader to Section XVI of Volume 2 (Federal, State, and Community Policies) for arguments for and against these views. Our purpose is to focus on the nature of standardized tests. We hope that in learning about standardized testing, readers can become critical consumers of testing-based statistics and arguments. For further insight into common misperceptions about tests and data interpretation, see Bracey (2006).
What does it actually mean for a test to be standardized? Cronbach (1960) argued that standardized tests were those in which the conditions and content were equal for all examinees. He defined a standardized test as “one in which the procedure, apparatus, and scoring have been fixed so that precisely the same test can be given at different times and places” (p. 22). Standardizing testing conditions and content is meant to increase the reliability of examinees’ scores by reducing sources of error extraneous to the abilities or skills being measured. For example, if examinees were given different directions for completing the test (e.g., to guess versus to leave a question blank when the correct answer is unknown), some differences in scores could be the result of directions rather than ability. Standardization attempts to reduce this possibility by holding as many factors as possible constant in testing.
Nearly 40 years later, the Standards of Educational and Psychological Testing (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999) reflects a shift away from focus on equal content, but a continued emphasis on equal conditions. Accordingly, standardization is a form of test administration designed to maintain “a constant testing environment” such that the test is conducted “according to detailed rules and specifications” (p. 182).
The testing community has also seen shifts in the level at which conditions are held equal. Sometimes it is necessary to provide accommodations to particular examinees. A new definition of standardization reflecting these changes exists today. What has remained constant across the changing definitions of standardization, however, is a focus on the purpose of standardization: to ensure fairness.
Placing Standardization on a Continuum
The strict definition of standardization proposed by Cronbach was never completely realized. Standardized test conditions suggest fixed administration procedures, but as Brennan (2006) argues, “It is particularly important to understand that psychometrics is silent with respect to which conditions of measurement, if any, should be fixed” (p. 9). Standardized administration conditions may or may not include time allotted, materials used (e.g., calculators, #2 pencils), and instructions given. To keep all such conditions equal would be unlikely in the real world. Most testing occurs in classrooms, where teachers likely respond to the different needs or questions of their students. Further, the early definition of standardization suggested that content, like conditions, should also be identical for all examinees. Equality of content may mean that the same items are given or it may mean that the same content domain is covered for all examinees. Test designers need to specify which conditions and content should be equal.
Thus the definition of standardization seems straightforward, but the application can be fairly complicated. Imagine a continuum of standardization with Cronbach’s definition (identical content and conditions) at one end and variations in content or conditions at the other. Many standardized tests today fall closer to the second description.
Identifying Variations in Content
Content can vary when individually administered tests are adapted to the examinee. For example, early intelligence tests often used basals and ceilings. Basals and ceilings are generally used in tests meant to span a developmental age range or range of material or ability. The practice continues today in tests like Test of Early Mathematics Ability, 3rd edition (TEMA-III). A test administrator begins the test at the item expected to be appropriate for an examinee of a particular age or grade. A ceiling occurs at the lowest consecutive number of items (5 on the TEMA) the examinee answers incorrectly. The basal is the lowest consecutive number of items (again 5 on the TEMA) the examinee answers correctly. If a basal has not been determined by the time the examinee reaches a ceiling, the test administrator will move backwards through the test beginning at the entry point until a basal is established (or until the lowest item has been given). The examinee does not complete items below the basal or above the ceiling, and the total score reflects the assumption that he or she would have answered questions below the basal correctly and questions above the ceiling incorrectly.
Test designers incorporate basals and ceilings to shorten the length of the test and minimize examinees’ frustration when asked to answer questions that are too difficult or boredom when asked to answer questions that are too easy. Additionally, items at either end of the difficulty spectrum do not provide useful information about an examinee’s ability level. Consider the implications of the use of basals and ceilings for the definition of standardization. When basals and ceilings are used, there is no guarantee that examinees are given the same items. Even within a relatively small age range, there may be examinees who do not take all the same items. To the extent that content differs from item to item, content is not standardized.
Furthermore, the interpretation of examinees’ total scores reflects assumptions about success or failure. Assumptions about student scores below basals and above ceilings are only as good as the sequence of test items. If the test is constructed to be used with basals and ceilings, the items must progress from easy to difficult. If items were calibrated on a representative sample, the test designers had the opportunity to put items in the necessary sequence—for that sample of students. Ordering items puts restrictions on the ability of test developers and content theorists to vary material (e.g., imagine a math test in which all addition problems were in the beginning so that students whose basals were higher than this level were not tested at all on addition). The correct sequence for one group may vary widely from another. Consider these issues: What if states or instructors teach material in a different order? What if the test has been translated? What if students from different socioeconomic or cultural backgrounds have differential exposure to particular content or themes? These questions point to the importance of understanding how similar the current sample of examinees is to the sample who took the items for sequencing, because the difficulty order of items may not be comparable across all groups.
Content may differ for individual examinees for other reasons. In computerized adaptive testing, groups of examinees can be given a test via computers, but the items they receive can be individually adapted so that examinees take a larger proportion of items with difficulties around their own ability levels. We distinguish computerized adaptive testing (CAT) from computer-based testing (CBT), a test given via a computer. CBT can be highly standardized because the test is administered the same way; all examinees are provided the same instructions and materials in every administration. Our interest lies in CATs, which attempt to estimate the examinee’s ability more efficiently (with fewer items) than traditional tests. CAT developers create large item banks with items calibrated to span the ability range of the population to be tested. CATs use items ordered from easy to difficult on a scale, and examinees can be ordered from less proficient to more proficient on the same scale depending on their performance on the calibrated items. For item-level CATs, one can assume that an examinee who answers correctly is at least as proficient as the item was difficult. One can also assume that an examinee who answers incorrectly is not as proficient as the item was difficult. The computer adapts, responding with more difficult items when examinees answer correctly and easier items when examinees answer incorrectly. One benefit of this process is that it generally takes fewer items to establish an examinee’s level of proficiency. Traditional tests must have a range of item difficulties that covers the ability range of the population for which the test is designed; examinees are given all of these items, even those well outside their ability range.
There are a number of consequences associated with CAT efficiency. One consequence is that each item is scored before the next one is given, and the examinee cannot return to previous items should a later item trigger information helpful in solving a previous item. Another consequence is that since items are presented based on difficulty and the examinee’s performance, there is less control over content and order than test designers would normally have (Hendrickson, 2007). Examinees are not expected to be at the same level; the items needed to estimate their abilities are not expected to entirely overlap. Yet, designers of large-scale group tests still attempt to standardize content by creating item banks with items matched on content but varying in difficulty. Thus in CAT, as in the use of basals and ceilings, items taken by different examinees are not identical. Test developers must take steps to ensure that examinees have similar exposure to content. The more closely a test and curriculum are aligned, the more difficult it is for test developers to ensure similar content, which is why adaptive testing is more often seen with tests designed to measure general abilities or skills developed over an extended time.
Multistage adaptive tests are a response to perceived problems with item-level CATs. Like item-level CATs, multistage adaptive tests are more efficient than conventional tests because the items given are selected based on the examinee’s performance. The difference is that multistage adaptive tests adapt less often, after groups of items called testlets. Examinees can review previous items within a testlet and the content coverage of testlets can be more easily balanced (Hendrickson, 2007).
Now consider the implications of individual adaptations for standardization. We will use an example relevant for college students, the Graduate Record Examination (GRE). Students pursuing graduate education are often required to take the GRE General Test, which is composed of three sections: Verbal, Quantitative, and Analytical Writing. The first CAT versions of the GRE were given in 1993 (Zwick, 2006). Examinees who are administered the test by computer take a CAT version of the Verbal and Quantitative sections. The Analytical Writing section is composed of two essays that may be computer administered, but are not computer graded. An examinee’s score on the two CAT sections depends on performance on the items and also on whether all items were completed in the given time (http://www.ets.org). The GRE is thus a speeded test and is standardized in terms of time allotted.
There are ways in which the GRE is not standardized. For example, examinees take the GRE on computer, but where computer facilities are not available, examinees complete a paper test. In 1999, the paper test was discontinued for students in the United States (Zwick, 2006), but paper formats are still used in some countries (http://www.ets.org). The format is therefore not standardized, but the test remains speeded regardless of format.
Items are not the same for all examinees because the computer version of the GRE is a CAT. Which questions an examinee receives depends on performance on previous items and on considerations of the test designers. As noted with basals and ceilings, test designers must attempt to take into account content coverage. It would not be fair for a person to receive a high score on the Quantitative section by completing only algebra problems while another person must successfully complete algebra, geometry, and interpretation from graphs to receive the same score. Test developers must ensure that content is similar for all examinees.
On the paper GRE, an examinee’s raw score is the number of correct responses from the total number of items given. The paper raw scores and the CAT scores are converted to scaled scores via equating (http://www.ets.org), a process that takes into account the item difficulties to ensure that individuals of similar ability receive similar scores even if they took different items. (Equating technically produces what are known as tau-equivalent scores rather than equivalent scores, because CAT scores should be more reliable than paper scores.) Consider the implications of scoring: Imagine two high-performing students take the GRE. The first student takes the CAT version while the second student takes the paper test. Both perform well, but the consequences of missing an item for the second student are more severe than they are for the first student. Why would this be? The CAT responds to high performance by providing more difficult items. The second student’s test cannot adapt. In other words, it is likely given their performance that the first student receives a more difficult test than the second student. Missing more difficult items is less costly for the first student because scoring reflects item difficulty. At first glance, it may not seem fair; it may even seem like the first student receives more chances to do well on the test. But consider this: If the second student misses items on a medium-difficulty test designed for the majority of examinees, chances are he or she would not have been given as many high-difficulty items on a CAT. The test is fair, because scoring takes into account the different items given.
Identifying Variations in Administration
With Cronbach’s definition, standardization provides all examinees with identical opportunities to demonstrate their abilities on a test. We have already seen that standardization today no longer means identical items, but rather comparable content over different items. What about identical chance for success? One thing standardization does not (cannot) do is equalize the skills that examinees bring to exams. Typically, equalization is something that test designers do not wish to do. Examinees with more expertise or ability have a better chance of scoring well. Tests are designed to differentiate examinees for some purpose. If the purpose is to determine which examinees have particular knowledge or skills, it would not be helpful for all examinees to score the same; however, sometimes test users need to be concerned with imposing fairness on what examinees bring with them to a test. Legislation regarding testing has highlighted the importance of testing accommodations as a means to ensure fairness rather than equality of testing conditions.
In Breimhorst v. Educational Testing Service (2001), a student sued ETS after his score on the Graduate Management Admissions Test (GMAT) was flagged with an asterisk because he used an accommodation for extended time on the test. He argued that the flag, which ETS used to signify that his score may not be comparable to other scores due to the use of nonstandard conditions, caused graduate programs to interpret his performance differently than they would the performance of other examinees. The flag resulted in unfair bias in decision making. To deny the student accommodations would have been unfair. The Breimhorst settlement also indicated that identifying his score as different was unfair. Even though test conditions were not standardized, the score-based decisions should follow the same criteria as were used for other examinees. Presence of a flag reduced that possibility. The crux of the case was the content being tested. Unlike the GRE, the GMAT does not consider speed to be part of the content domain being measured. If the student took longer, but completed comparable content, the scores should be interpreted the same way.
A few cautionary statements should be made, however. First, if time is truly not meant to be considered part of the content domain being measured, then why set a time limit at all? The Breimhorst settlement is only valid if time really does not affect scores. If students who do not need accommodations would see improved performance with extended time, then time is part of the domain being measured and the issue of flagging must be revisited (or all students should be given as much time as they feel necessary to complete the exam). Unfortunately, research into the effects of accommodations is not regularly undertaken. Second, the individuals who set accommodations (e.g., the individuals responsible for crafting a student’s Individualized Education Plan) may not be well-versed in the psychometric considerations associated with standardization. Thus, it would be beneficial for both measurement researchers and accommodations experts to work together to better understand the role of accommodations in standardized testing.
Clearly the language of standardization has shifted from equality to fairness. The shift is paralleled in the Standards (AERA, APA, NCME, 1999), with 12 standards directly related to fairness in testing. Of these, “. . . six standards refer to interpretation, reporting, or use of test scores; three to differential measurement or prediction; and three to equity or sensitivity” (Camilli, 2006, p. 226). Of particular interest for understanding the shift from equality to fairness in testing conditions is Standard 7.12, which states, “The testing or assessment process should be carried out so that examinees receive comparable and equitable treatment during all phases of the testing or assessment process” (AERA, APA, NCME, 1999, p. 84). The conditions of a test must be made fair for all examinees through providing accommodations (such as extended time or use of Braille) to the extent possible without infringing on the content domain being measured. Steps should be taken to account for variance extraneous to the purpose of the assessment in ways that are fair (even when these steps require conditions that are not equal).
The new definition of standardization reflects a preference for fairness over equality. Items examinees receive may not be identical, but to be fair, all examinees should be tested on similar content. Likewise, conditions may differ, but only in ways that result in fair assessment of the relevant abilities or skills that examinees bring to tests.
Interpreting Scores on Standardized Tests
Understanding Norm-Referenced and Standards-Based Approaches
Two approaches to making an examinee’s performance meaningful are common: norm-referenced and standards-based comparisons. In the norm-referenced approach, the examinee’s score represents an indication of how his or her performance compares to the performances of other examinees in one or more comparison groups of interest (called norm groups). Depending on the purpose of the assessment, test users may wish to compare examinees to a single group or to several groups. For interpreting performance on an intelligence test, the group of interest is a national sample of individuals of the same age as the examinee. For interpreting the score on an achievement test, one may want comparisons with several groups. One of those groups may be a national group, and other groups may be local students or students who have had similar preparation for that particular achievement area. Comparisons are made through the use of a reported score, often a percentile rank (PR) representing the percentage of individuals in the particular group whose scores are below the examinee’s score. Mid-interval PRs, combining the percentage below with half of the percentage obtaining the same score as the examinee, are also common. On an intelligence test, the PR is usually transformed to an IQ, a standardized score scale with a mean of 100 and a standard deviation of 15.
With the standards-based approach, the student’s score is compared with standards of performance that have been set before examination to indicate a subjective judgment of the quality of the performance. Rather than comparing a score with other individuals, this approach is based on comparing a score with desired levels of achievement established by a panel of experts and associated with verbal or numeric labels. Different panels of experts and different methods of standard setting result in different definitions of levels such as proficient or failing. For example, the set of levels used for the National Assessment of Educational Progress (NAEP) includes Basic, Proficient, and Advanced. Arizona has defined four levels to be used with their standards-based test: Falls Far Below the Standard, Approaches the Standard, Meets the Standard, and Exceeds the Standard. The levels and meanings used by states for their testing programs differ from each other and from the NAEP. The No Child Left Behind Act (NCLB) requires states to define a label indicating a student meets the state standard for proficiency.
The differences between norm-referenced and standards-based score interpretations should not be ignored, and many newspaper accounts of school performances are misunderstood because knowledge of these differences is lacking. For example, when comparing how well students achieved in one school district in Arizona, a newspaper informed readers that norm-referenced scores demonstrated that students were performing above national average in mathematics and below average in reading. When state standards-based levels of achievement were reported, the same newspaper informed readers that more students were meeting or exceeding standards in reading than in mathematics. As a result of these different interpretations, the judgment about which subject was more problematic for Arizona students depended on which account was cited. Such differences may result from different tests being used, but in this case it was due to using different approaches to reporting outcomes. The state’s standards-based levels were not aligned with the national norm comparisons. The state levels for “Meets the Standard” in reading were set much lower than for mathematics (Sabers & Powers, 2005). That is, students performing about average in both reading and mathematics according to national norms were assigned the label “Approaches the Standard” in mathematics and “Meets the Standard” in reading according to the state’s standards.
Selecting Norms: National Sampling Projects And Test Users
The use of a national norm group is so common that some people have used the term norm-referenced test to refer to a standardized test where the results are reported as national percentile ranks. But it is not the test that is norm referenced; rather, it is the interpretation of the scores. It is important to note that it is difficult to obtain a sample that truly represents the national performance on any test. A large “anchor test study” (see Linn, 1975) comparing the norms for achievement tests demonstrated that there were substantial differences among the norms reported for some tests. Although each set of norms represented the results of a national sampling project, reported student achievement was different depending upon which groups were used for comparison. Baglin (1981) offered a possible explanation: self-selection bias, the result of schools accepting or declining an invitation to be part of the standardization group for a particular test, caused the samples to differ from each other and from a true national sample. For example, some schools appear more likely to participate in a norming study when the test company is also the company they rely on for textbooks. In contrast, the personnel of other schools believe they are already doing too much testing and do not wish to participate in additional testing for test companies (or for NAEP).
There have been other attacks on national test norms as well. Cannell (1988) suggested that the reports on student performance given by school systems and states tended to be overly optimistic, and termed the phrase Lake Wobegon effect to describe the similarity to that mythical location in Garrison Keillor’s stories where all of the students were above average. One explanation for how well students seemed to be performing when compared to national norms was that the national sample data included scores from unmotivated students who did not see a meaningful reason to participate. Examples of teachers telling students during standardization that this test won’t really count perhaps did more than reduce student anxiety. In other words, the administration conditions and the subsequent performance of students when the test is being given for norming purposes simply may not be comparable to the administration conditions and student performance observed for the same test when there are high stakes for students, schools, and states. During norming, students may see the test as practice and therefore not make genuine attempts to do their best. Students may be more motivated to perform well when high stakes are involved. The extent to which this explanation is legitimate was never subjected to vigorous study, but these types of considerations should be a part of score interpretation.
If national norms are not really national, what is the proper approach to obtain better norm-referenced interpretations? User norms are a popular alternative, as these norms represent students who have similar reasons for taking the test. With user norms, motivated students who know they will be evaluated based on their performance produce the scores that are collected for the norm group. Tests reported with national samples of user norms include the GRE, ACT, and SAT. Another example of user norms is the local norm, or the sample consisting of the examinees in a local setting, such as when the comparison group consists of all students in a particular grade in an entire school district or state.
Comparing Examinees then and NOW: Interpreting the GRE
The GRE is an excellent example of combining both approaches to reporting scores. The test is intended to be used by graduate programs, providing scores that can give additional information beyond undergraduate grade-point averages and other measures used in the selection of students for admission, fellowships, and scholarships. The test is taken by a subset of college students who intend to apply to graduate school; so comparing examinees to a national sample of college students would not be useful. Rather, the interpretation of scores is made more meaningful by comparison to user norms—the performances of other examinees who are also aspiring to graduate school admission.
A performance on the GRE General Test is given meaning by two different comparisons. The first comparison, for the Verbal and Quantitative sections, is with a score that ranges from 200 to 800 in 10-point intervals. This score has been used for half a century and is retained for purposes of continuity across decades. Originally, the score was described as centered on 500 with a standard deviation of 100 for each of the two sections of the test, but the current averages and standard deviations are different. The other section of the GRE has changed through the years and is currently called Analytical Writing, with scores reported on a standards-based scale ranging in halfpoint intervals from 0 to 6. The second comparison for all three sections is a PR describing the percentage of examinees over a recent 3-year period who scored lower than the specified score. The 3-year period provides a database of well over 1 million examinees and allows for a current comparison that is quite different from the reported score based on the original scaling group. Thus, one type of score allows for longitudinal comparisons across years while the other type is more useful for comparing the status of current applicants. One can see from the average scores that Verbal scores are now substantially lower and Quantitative scores much higher for current examinees than for the original test development sample. One can also see from the current PRs how each individual student compares with the other applicants for admission to a graduate program.
ETS (2006) provides information showing that the performance on the Verbal and Quantitative sections must be interpreted differently. Although originally the scores on these two sections were concordant, there is a great discrepancy in the performances based on the June 2002 to July 2005 sample (the most recent group). The top score (800) on Verbal surpasses over 99% of the users in the group, whereas the top score on Quantitative (800) surpasses only 94%. The average score on Verbal is 467; on Quantitative it is 591. Many people interpreting scores are misled by these shifts if they use the original 500 as their concept of average.
Most students take the GRE as a CAT, and the advantages of the computer version are easily seen with the GRE. As we noted, students are not expected to take items that do not make an important contribution to determining their score. But not every item presented to a student is scored as part of the test. New items may be included to calibrate scale values before being included in the regular item bank. The students taking the test do not know which set of items is experimental, and thus are motivated to provide realistic data on the item difficulty. With a large item bank, the content of the scored items is balanced so that the meaning of the scores does not differ greatly because of item selection. Thus GRE as a CAT appears very efficient for large samples of examinees, and the design allows meaningful comparison of scores both longitudinally and with the current examinees.
Changing Populations: Recentering the SAT
Due to changing populations of examinees over decades, the SAT faced similar problems with the scale scores as those seen with the GRE. Originally, the mean of the SAT scores equaled 500 for Verbal and for Math, but by 1990 the average Verbal scores were about 50 points lower than the Math scores. Why would this be? The populations of students taking the SAT (and the GRE) changed. Educators have explained the curriculum differences that may have contributed to the change in test scores—students taking more mathematics courses and placing less emphasis on reading and vocabulary. Other population changes may result from schools encouraging students who are less similar to the original examinee populations to take the tests.
The decision was made to recenter the SAT to make the scale more meaningful for current examinees. Recentering resulted in the new score scale again having an average of 500 on each portion of the SAT (Dorans, 2002). Before the recentering, many individuals believed they were lower in Verbal than in Math ability (even if they performed near the average on both sections) because they misunderstood how the averages had changed over time. Dorans reports that recentering the SAT scores improved test users’ interpretation of student performance, but shifts in the average scores on each section are already occurring. By 2006, the newly named SAT Reasoning Test scores averaged 518 for Math and 503 for Critical Reading (formerly Verbal). Recentering may be necessary again in the future. It may be important for the GRE to be recentered in the same way as the SAT, and test developers may wish to realign the GRE as well. Realignment would spread out the scores at the top of the scale; recall in 2006, 6% of students in the GRE Quantitative sample obtained the top score.
A disadvantage of recentering is that the longitudinal comparison of test scores is no longer evident with new scores. The SAT Reasoning Test can be used to compare score changes over the past decade and a half, but the longer-term comparisons were lost with the recentering. The test developer must weigh the advantages and disadvantages of every change. An advantage of having tests on different scales or having different averages for sections of the tests is that people might be more likely to consider each test as being different from others. When scores on different tests are comparable, the tests might be considered to be replacements for each other. For example, many people interpret an IQ as a measure of intelligence without considering what test was taken, a problem because different intelligence tests measure different traits. The sophisticated reader of test information must understand what a test measures as well as how scores are to be interpreted.
Understanding The Labels In Standards-Based Approaches
Given the difficulty of interpreting score performance based on norm comparisons, one might wonder whether it would be simpler to report standards-based information. But what does a label like proficient mean? The National Center for Educational Statistics (2007) has compared the performance of students across states by mapping state standards to the NAEP scale for reading and mathematics. The definition of proficient is established independently by each state, and the percentage of students being classified as proficient depends on that definition. The more challenging the level established by the state, the lower the percentage of students who will be identified as proficient. Typically, the degree of achievement necessary to reach the level of proficient is lower by the state definition than by the NAEP standard. In the 2005 sample for comparing Grade 8 reading scores, North Carolina had the lowest standard accompanied by an estimated 88% proficiency for their students. South Carolina had the highest standard accompanied by an estimated 30% proficiency. Because of the differing definitions, one cannot determine which state has the better readers.
Furthermore, a state can change the percentage of students at the proficient level drastically by altering the difficulty of the standards. In the 2003 Grade 8 mathematics comparison, Arizona had a very high standard associated with the “Meets the Standard” label (actually, one point above the NAEP “Proficient” cut score) that was accompanied by a disappointing 21% proficiency. In the 2005 comparison, Arizona had dropped their standard almost to the level of the NAEP cut score for “Basic,” the lower category, and reported 61% proficiency. This change in the percentage of students reaching proficiency associated with changes in standards cannot be interpreted meaningfully; one cannot determine whether the students tested in 2005 performed better or worse than the students tested 2 years earlier. Naive readers, however, may think the greater percentage of students reported as “Meets the Standard” indicates improvement in the state’s educational system.
Combining Approaches: The GRE and ACT
The ambiguity of the label proficient across states, grades, and subjects demonstrates the need for more than just standards-based reporting. The GRE Analytical Writing section is a good example of the combination of approaches to allow meaningful test interpretation. With the standards-based approach, the GRE has provided score level descriptions to help the reader understand what each score means on this section of the general test. For example, for scores 4 and 3.5, the description is: “Provides competent analysis of complex ideas; develops and supports main points with relevant reasons and/or examples; is adequately organized; conveys meaning with reasonable clarity; demonstrates satisfactory control of sentence structure and language usage but may have some errors that affect clarity” (ETS, 2006, p. 23). In addition, each score level (in half-point increments) is associated with a percentile rank to give a norm-referenced interpretation based on a current sample of over 1 million examinees. For the October 1, 2002, to June 30, 2005, sample, 32% of the examinees scored lower than 4.0; 17% scored lower than 3.5. The use of both standards-based and norm-based score information in this example provides users with a much clearer understanding of what examinees’ scores mean than would be possible with only one approach.
Another example of combining approaches to make score interpretation more meaningful is the use of benchmarks with the ACT. According to the 2006 ACT High School Profile Report, “A benchmark score is the minimum score needed on an ACT subject-area test to indicate a 50% chance of obtaining a B or higher or about a 75% chance of obtaining a C or higher in the corresponding credit-bearing college course” (p. 6). In 2006, the percent of ACT-tested students ready for college-level coursework by this criterion ranged from 69% in English composition to 27% in biology. Such information can help counselors assist students in making decisions about their preparation for college.
States may also choose to combine approaches in their state assessments. These dual purpose assessments embed items from a nationally-standardized test in a state-developed standards-based test to provide both national PRs and standards-based scores for interpretation. The advantage of this combination is that it allows schools to spend less time testing students than would be required to administer both a nationally normed and state standards-based test. However, fewer nationally-standardized items are given than standards-based items, the nationally-standardized items are not given in their intended context, and the validity of this combination has not been documented.
Understanding The Unit Of Comparison
When reporting performance, one should compare groups to groups and individuals to individuals. A misuse of norms is often encountered in newspaper reports when the average performance of all students in a school is compared with a collection of individual scores. Because individual scores have much more variability than do group means, the result of comparing a school average with individual students is that all schools appear nearer to average than actual performances are. This issue is not related to the Lake Wobegon effect mentioned earlier, but is more a central tendency effect; the truly extreme-scoring schools (both those far above and far below the mean) are not identified as being as different from average as they really are.
Some tests are devised only for group interpretations. The NAEP does not report scores for individual students; however, NAEP data provide ample opportunity to compare subgroups. The NAEP sampling method allocates different samples of items to students within a school. By carefully selecting and recording demographic characteristics of each examinee, NAEP allows performance comparisons for racial/ethnic and gender groups on various subsets of items. If an adequate representative sample of schools within each state participated in the NAEP program, it would not be difficult to use the test as a national test for comparing states. These comparisons would be fairer to the states than comparisons made with the volunteer samples that take the ACT or SAT.
One needs to know a great deal about the tests and the samples before interpreting score differences, partly because the groups are not equally represented in these samples. For example, comparisons of gender groups with the ACT and SAT produce different results. In the 2006 ACT national sample, there were 646,688 females and only 517,563 males (3% of the students did not report gender). If the additional examinees are lower-scoring students, as is often the case, then the picture presented by the obtained scores is misleading. On the SAT, males score higher than females on both Critical Reading and Math, whereas on the ACT males score higher on Math, Science, and Total and females score higher on English and Reading. Females score higher on Writing on both tests. In general, different students take the ACT and the SAT. Differences in the populations and in the content of the two tests contribute to the score differences observed between males and females.
Issues of fairness in testing have resulted in a new definition of standardization. Content and administration do not need to be identical, but rather sufficiently comparable for all examinees. Fairness is also important in interpretation of test scores. Scores are interpreted with reference to comparison groups or predetermined standards for performance. Combining norm-referenced and standards-based approaches may lead to better understanding, but readers must understand what tests were designed to measure and how scores were intended to be interpreted. We believe simply gaining a better understanding of the ways of making performance meaningful would do much to aid individuals in accomplishing the primary goal of standardization: making fairer decisions based on test information.
- (n.d.). ACT high school profile report: The graduating class of 2006, national. Retrieved from http://www.act.org/content/dam/act/unsecured/documents/Natl-Scores-2006-National2006.pdf
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
- Baglin, R. F. (1981). Does “nationally” really mean nationally? Journal of Educational Measurement, 18, 97-107.
- Bracey, G. W. (2006). Reading educational research: How to avoid getting statistically snookered. Portsmouth, NH: Heinemann.
- Breimhorst v. Educational Testing Service, Settlement Agreement, Case No. 99-3387 (N.D. Cal. 2001).
- Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 1-16). Westport, CT: Praeger.
- Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 221-256). Westport, CT: Praeger.
- Cannell, J. J. (1988). Nationally normed elementary achievement testing in America’s public schools: How all 50 states are above the national average. Educational Measurement: Issues and Practice, 7, 5-9.
- Cronbach, L. J. (1960). Essentials of psychological testing (2nd ed.). New York: Harper & Row.
- Dorans, N. J. (2002). Recentering and realigning the SAT score distributions: How and why. Journal of Educational Measurement, 39, 59-84.
- Educational Testing Service. (2006). GRE Graduate Record Examinations 2006-2007. Retrieved from https://www.ets.org/Media/Tests/GRE/pdf/gre_0809_factors_2006-07.pdf
- Hendrickson, A. (2007). An NCME instructional module on multistage testing. Educational Measurement: Issues and Practice, 26, 44-52.
- Linn, R. L. (1975). Anchor test study: The long and the short of it. Journal of Educational Measurement, 12, 201-214.
- National Center for Education Statistics. (2007). Mapping 2005 state proficiency standards onto the NAEP scales (NCES 2007-482). Washington, DC: Author.
- No Child Left Behind Act of 2001. P. L. 107-110, (2002). Retrieved from https://www.congress.gov/107/plaws/publ110/PLAW-107publ110.htm
- Sabers, D., & Powers, S. (2005). The condition of assessment of student learning in Arizona: 2005. In D. R. Garcia & A. Molnar (Eds.), The condition of pre-K-12 education in Arizona: 2005 (pp. 9.1-9.15). Tempe, AZ: Educational Policy Studies Laboratory.
- Zwick, R. (2006). Higher education admissions testing. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 647-679). Westport, CT: Praeger.