Test Administration Research Paper

View sample Test Administration Research Paper. Browse other  research paper examples and check the list of research paper topics for more inspiration. If you need a religion research paper written according to all the academic standards, you can always turn to our experienced writers for help. This is how your paper can get an A! Feel free to contact our custom writing service for professional assistance. We offer high-quality assignments for reasonable rates.

The meaning of an individual’s score on an educational or psychological test depends on the test being administered according to the same specified procedures, so that every test-taker faces the same situation. A test administered under specified conditions is said to be standardized. Without standardization, scores from the test will be less accurate, and comparability of test scores will be uncertain. Details of administration depend on the format and content of the test.

1. Standardization

Standardized directions help to insure that all test takers understand how much time will be allowed to complete the test, whether it is permissible, or advisable, to omit responses to some test items, how to make the required responses, and how to correct inadvertent responses. Multiple choice items merely require the test taker to choose among a set of alternative answers. Constructed response items may require the test taker to write a short phrase or sentence as a response, or may require a more extended written response. Performance assessments will often require the test taker to perform some task, possibly using some equipment, under standardized conditions, with the performance evaluated by a test administrator, or recorded for later scoring. In general, a quiet environment, free of distractions, will promote appropriate test performance.

Instructions for paper-and-pencil tests are simple; multiple choice items may require the test taker to mark the answer on a separate answer sheet whereas short-answer items require writing a response on the test paper. Test takers are more likely to need practice in responding to tests that are administered by a desktop or similar computer. Items are displayed on the computer display screen, and the test taker responds by means of computer keys, a ‘mouse’ or similar indicating device. In some instances, the test-taker is asked to respond to constructed response items by typing a word or short phrase with a computer keyboard.

Test takers may need practice in navigating a computer-based test. Usually, the screen shows one item at a time, and does not proceed until some response has been made. One of the response options may be to skip or postpone the item, in which case the computer will offer the item again after the remaining items in a section have been answered. Sometimes the test taker has the option of flagging the item for later reconsideration. Item review, which cannot be controlled on paper-and-pencil tests, is more complex with computer administration and may not be permitted. Test takers prefer to have the option, but those who make use of it seldom change their previous response. When a change is made, however, it is more often a change from wrong to right than from right to wrong. The net effect of item review is an occasional score improvement, weighed against time spent in making the review. When the test is timed, and if further new items are available, the candidate would, in general, be better off attempting the new items rather than returning to items that have previously been answered.

Computer administration permits the use of items that are not feasible with paper-and-pencil tests. Most involve some aspect of temporal control or measurement. Short-term memory can be tested by presenting some material, and then asking questions about it after it is removed from the display. Some aspects of response time can also be measured. Also, test material can include moving displays, including short segments of televised scenes or activity.

1.1 Accommodations And Test Modifications

Persons with disabilities may be at a disadvantage in the testing situation. Persons with limited vision may have trouble seeing the test material; others with motor impairment may have difficulty making the required responses. Special test administrations may be offered to accommodate such persons (Willingham et al. 1988). Printed tests with large type, or large fonts on the computer display may help persons with limited visions. A person or a computer system can read verbal items to the test taker, although items with diagrams and graphs pose more challenging problems of communication. Persons with certain motor disabilities may need help in making the appropriate responses. Persons with attention deficits, dyslexia, or similar disorders may easily be granted additional time on timed tests.

When persons are tested under nonstandard conditions, the validity of the score becomes suspect. Much validity evidence is empirical, and depends on standardized test administration. If only a few persons are offered a certain accommodation, it becomes difficult to determine whether their test score has the same meaning as a score from a standardized test. Most such evidence requires substantial numbers of cases in order to be evaluated. Opinions differ about the suitability of using test scores from accommodated tests. In large-scale operations, a score from a special accommodation may be marked, or ‘flagged’ to indicate its special status. It may be deemed appropriate to indicate the nature of the accommodation, but not the nature of the disability, although it will often be impossible to report one without revealing the other. In some venues, revealing the nature of a person’s disability may be seen as a violation of individual rights, and may be illegal.

Persons whose primary language is not the language of the test will probably be at a disadvantage in taking the test. In some circumstances, it may be appropriate to provide the test in their primary language, although test translation is not an easy matter, and requires extensive test development. An intelligence test or a questionnaire about health may elicit more valid answers if translated. On the other hand, an employment test for a position where a certain language is used as a part of the job would be appropriately given in the language of the job. (For more detail on this and other topics see American Educational Research Association et al. 1999.)

2. Test Security And Cheating

Test scores are often used to compare individuals with each other, or with established performance standards, with important consequences to the individual test taker. Some test takers may attempt to gain an advantage by obtaining knowledge of the test questions before the test is given. Thus, procedures for safeguarding the test materials before, during, and after the testing are an important aspect of test administration. Generally, test takers are forbidden from taking notes; any scratch paper or other material that they may have used during the test must be disposed of.

Obtaining advanced knowledge of items is but one way of gaining an advantage over other test takers. Test administrators may be asked to check identities through photographs, fingerprints, or other means of personal identification to preclude the possibility of a candidate arranging for someone else to take a test in their stead.

A variety of statistical checks may be made of test responses in an effort to detect various forms of cheating. If several persons are found to have exactly the same responses to many of the items, further investigation may be warranted. A person who answers many easy questions wrong, but who correctly answers many difficult items has, to say the least, an unusual pattern of responses, triggering further checking. If a test is offered on several occasions, and a person has taken the test more than once, large differences in the person’s scores from one time to the next would be suspicious.

2.1 Test Preparation

Testing programs with high stakes have spawned a number of organizations offering special tutoring in how to take the test. Such short term preparation (‘coaching’) may involve having clients take pseudotests that contain items like those that might be expected on the actual test. Methods are taught for coping with particular types of items. Sometimes, a variety of suggestions are offered that are not content-related, but that stress how to detect the correct alternative by special clues, such as ‘the longest alternative is probably the correct answer’. Studies of the effectiveness of these commercial test preparation programs have had mixed results, but the best estimate is that the advantage of a two-or three-week course is from about 0.1 to about 0.3 standard deviation units, which is roughly equivalent to one or two additional questions answered correctly. Longer preparation has a larger effect; multiweek courses amount to further instruction in the subject of the test, and may more properly be viewed as education.

Large testing programs, concerned lest commercial preparation courses give well-to-do candidates a special advantage, have made preparation materials available to all test candidates. This may simply be older, retired forms of the test, together with general test-taking advice. Many also offer, for small cost, computer software to provide the sort of instruction that the ‘coaching’ schools provide. Still, some candidates put themselves at a disadvantage by not using such material, being overly confident of their own proficiency.

3. Test Scoring, Norms And Performance Standards

Test standardization extends to test scoring. Responses to paper-and-pencil tests can readily be scored by computer. Constructed responses can sometimes be scored by computer, but these responses usually are scored by human scorers. In some instances, computer programs are being used to score extended written responses and essays. Preparing a detailed scoring rubric, which spells out the detailed criteria by which responses are to be evaluated, can facilitate consistent scoring, by human or computer. Human scorers can be trained on the use of the rubric, and are monitored through a variety of checks, including occasional rescoring of some responses, and statistical comparisons.

Test scores usually consist initially of simply counting up the number of correct responses, or summing the scores given to items when the responses are not scored simply 1 or zero. Sometimes, a penalty is imposed for incorrect answers, in an attempt to correct for the possibility that a test taker might mark the correct response to a multiple-choice item by chance. Despite voluminous evidence, no definitive statement can be made as to whether the penalty for wrong responses improves or harms the validity and reliability of the resulting scores. Omitted responses are generally given no credit, i.e., treated the same way as wrong responses.

Scores based on item response theory involve combining information from item responses by a more complex statistical method. In effect, more credit is earned for correct answers to more difficult questions, and more penalty is imposed for incorrect answers to easier questions (an exception is the scoring of fixedlength nonadaptive tests using the one-parameter, or Rasch, IRT model, in which case the number correct is a sufficient statistic for estimating the person’s proficiency).

Test scores have little meaning without additional information. Norms are statistical distributions of test scores made by defined groups of test takers, and provide one normative standard of comparison. External performance standards can be set with reference to test content. Some tests are designed to provide scores understandable only by professionals; with only broad categories of performance being revealed to the test taker (Angoff 1971).

4. Test Equating And Calibration

Well-established testing programs generally have several equivalent forms of the test. Each test form has different items, but all forms are constructed according to the same set of test specifications, or test blueprint, which indicates topics to be covered, item difficulty, item formats, and other aspects of item content. Still, test development procedures cannot insure score comparability; some adjustments are necessary in the scores from a new form to make them comparable to scores from old forms. The statistical process of making scores comparable is called test equating (Feuer et al. 1999). Linear equating involves giving two forms of the test to equivalent groups of test takers, and then adjusting the score distributions to have the same mean and standard deviation. Equipercentile equating uses the same data but adjusts the score distributions to have the same shape, so that for any given test score, the same proportion of test takers earn scores that are below that score value in both tests. Anchor test equating involves administering a common segment of along with both tests, to form the statistical link.

5. Adaptive Testing

The use of computers to administer tests has led to the development of computer-based adaptive testing (CAT) (Wainer 1990, Sands et al. 1997). In CAT, the selection of items presented to a test-taker is shaped by the test-taker’s own responses to the test. Normally, the first question is of medium difficulty. Those responding correctly to the item will be given a more difficult item; those giving a wrong response will get an easier item. The same principle continues until a fixed number of items has been administered, or until some other criterion is reached. The net result is that each person is asked questions appropriate to his or her own proficiency level. Time is not wasted asking persons of low proficiency items that are too difficult for them, and high scorers are not asked questions that are too easy for them. The level of accuracy of scores on a traditional test can be achieved in a adaptive version of the test with from 30 percent to 50 percent fewer items.

The idea of adaptive testing can be traced to Binet, the originator of the intelligence test. Some current tests of intelligence for children follow a similar procedure, adapting the level of question to the demonstrated ability of the test taker to respond correctly. Such tests are administered individually, so the test administrator can readily select subsequent items according to a specified protocol. Individual administration is impractical with large-scale testing; group administration is more efficient, but all test takers must take the same test. The use of a computer for administration means that tests can again be tailored to the test taker, without the need for a trained administrator.

CATs get favorable evaluations from test takers. In a CAT, each candidate faces items that are challenging but not overwhelming. Persons who usually do poorly on ability tests are surprised at how easy the test appeared to be. The better candidates find a CAT harder than they expected, but engaging.

5.1 CAT Procedures

Different CAT systems use different procedures for administering a CAT. Algorithms differ in how the first item is chosen, how subsequent items are chosen, when to stop the test, and how to score the test. With different test takers receiving tests of different difficulty, the score cannot be the traditional ‘numberright’ of paper and pencil tests, since most test takers get about the same number of items correct on a well-designed CAT. Current CAT’s rely on item response theory (IRT) for scoring the test responses, as well as for choosing items to present. IRT is based on the premise that all the items are measuring the same underlying proficiency. While each item is to some extent unique, the items also share some common elements. The IRT model accounts for item interrelationships with a model that specifies the form of the relationship of each item to the dimension of proficiency being measured.

CAT requires a large set of items, called collectively an item pool, or item bank. Each item is characterized by one or more parameters indicating the item’s difficulty, strength of relationship to the proficiency being assessed, and propensity of low proficiency test takers to guess correctly the right answer. The item characteristics, called ‘item parameters are estimated statistically from data obtained from administering the items to a group of test takers that serves as the reference group. Since an item bank contains from 200 to 1000 or more items, all items cannot be administered to a single group of persons. Instead, an elaborate design can be used to administer the items to equivalent groups, and to calibrate all the items to a common scale. The statistical calculations necessary to score a CAT can be done in the blink of an eye on modern computers, so a test score is available as soon as a person has completed taking the test. Of course, if constructed response items are included, the computer program must include an algorithm for scoring such responses immediately. Such algorithms are rapidly becoming the state-of-the-art.

Item selection in CAT is usually done one item at a time, but may also be done by selecting groups or items, in stages. Selection in general depends on a preliminary estimate of the test taker’s proficiency, based on the item responses already made. The next item, or group of items, is selected to be maximally informative, given the current performance. Often, many other factors must also be considered. Traditional tests that are prepared professionally conform to a detailed set of test specifications, having mainly to do with item content. Such specifications are especially important for achievement tests, which measure the extent of mastery of some domain of knowledge. Each constructed test is expected to have the same balance of content and coverage of topics. By using codes for various content and format characteristics of an item, the item selection algorithm for a CAT can strive to select items so that the content specifications are met as well as possible, as well as finding items of the appropriate difficulty for the test taker.

With CAT, the problem of security is daunting, because persons are scheduled to take the test at different times. Groups of test takers can team up to each remember a few items, and later pool their memorized items. The remembered items can then be offered, possibly for profit, to persons who have yet to take the test. To diminish the effect of foreknowledge of a few items, CATs may be furnished with very large item pools, so that the chance of encountering any particular item will be very small. More elaborate strategies involve using several different item pools, formed from very large collections of items (‘item vats’). Studies show that the effect of foreknowledge of a few of the items in a large pool will, on average, provide very small improvement in test scores.


  1. Angoff W H 1971 Scales, Norms, and Equivalent Score. Educational Testing Service, Princeton, NJ
  2. American Educational Research Association, American Psychological Association & National Council on Measurement in Education 1999 Standards for Educational and Psychological Testing. American Educational Research Association, Washington, DC
  3. Feuer M J, Holland P W, Green B F, Bertenthal M W, Hemphill F C (eds.) 1999 Uncommon Measures: Equivalence and Linkage Among Educational Tests. National Academy Press, Washington, DC
  4. Sands W A, Waters B K, McBride J R 1997 Computerized Adaptive Testing: From Inquiry to Operation, 1st edn. American Psychological Association, Washington, DC
  5. Wainer H 1990 Computerized Adaptive Testing: A Primer. L Erlbaum Associates, Hillsdale, NJ
  6. Willingham W W, Ragosta M, Bennett R E, Braun H, Rock D A, Powers D E (eds.) 1988 Testing Handicapped People. Allyn and Bacon, Boston
Test Anxiety Research Paper
Terrorism Research Paper


Always on-time


100% Confidentiality
Special offer! Get discount 10% for the first order. Promo code: cd1a428655