Performance Evaluation In Work Settings Research Paper

Academic Writing Service

Sample Performance Evaluation In Work Settings Research Paper. Browse other  research paper examples and check the list of research paper topics for more inspiration. If you need a religion research paper written according to all the academic standards, you can always turn to our experienced writers for help. This is how your paper can get an A! Feel free to contact our research paper writing service for professional assistance. We offer high-quality assignments for reasonable rates.

Job performance is probably the most important dependent variable in industrial and organizational psychology. Measures of job performance are typically employed for many research and practice applications. Examples are evaluating the effects of a training program or a job redesign effort on job performance, and assessing the validity of a selection system for predicting performance. For these and other personnel-related interventions, what is needed are accurate measures of job performance to assess the effectiveness of the intervention.

Academic Writing, Editing, Proofreading, And Problem Solving Services

Get 10% OFF with 24START discount code

This research paper first introduces the concept of criterion relevance and other ‘criteria for criteria.’ The second topic discussed is the issue of whether work performance is best characterized by multiple criteria or by a composite criterion, that is, a single summary index of performance, and third, methods of measuring criterion performance are presented. Finally, what appear to be the most important future directions in describing and measuring work performance are highlighted.

1. Criterion Relevance And Other Standards For Criteria

What differentiates good from poor criterion measurement? That is, what are the most important features of good criteria? The most important is relevance. Relevance can be defined as the correspondence between criterion measures and the actual performance requirements of the job. Criterion measures should reflect all important job performance requirements. Useful in this context are the terms deficiency and contamination. Deficiency for a set of criterion measures occurs when the set is relevant to only part of the criterion domain. A job knowledge test for an insurance agent will probably fail to cover the interpersonal skill part of the job. Contamination refers to a criterion measure tapping variance in performance beyond the control of the organization member. For example, sales volume for account executives may be based in part on their performance, but also on how easy it is to sell the product in their own territory. A perfectly relevant set of criterion measures is neither contaminated nor deficient.

The second most important standard for good criterion measurement is reliability. Criterion scores should be consistent over, at least, relatively short time intervals. For example, if a criterion measure for fast food servers is ratings of customer service quality by multiple customers, we would like to see consistency (i.e., reliability) in how different customers rated the servers, with good performers in this area rated consistently high, poorer performers, consistently lower, etc. There are other criteria for criteria that some authors have mentioned (e.g., acceptability of measures to the sponsor) but relevance and reliability are clearly the most important.

2. Multiple And Composite Criteria

An issue in measuring work performance is whether a single criterion should represent the performance requirements of a job or whether multiple criteria are required. Advocates for a single composite criterion see criteria as economic in nature, whereas supporters of multiple criteria view them more as psychological constructs. Probably the best way to resolve the issue is to recognize that the purpose of performance measurement largely dictates the appropriateness of the two views (Schmidt and Kaplan 1971). For example, when making compensation decisions for a unit’s members based on performance, it is necessary to obtain a single performance score at some point for each employee. To make these decisions, we need an index of overall performance, worth to the organization, or some similar summary score. On the other hand, if the goal is to understand predictor–criterion links in personnel selection research, for example, then using multiple criteria and examining relationships between each predictor (e.g., ability, personality, etc.) and each criterion (e.g., technical proficiency) is probably most appropriate.

A related issue is the empirical question of how highly correlated multiple criteria for a job are likely to be. If correlations between criterion measures are high, then a single composite measure will be sufficient to represent the multiple criteria. On balance, research suggests that most jobs have multiple performance requirements. Performance on these multiple dimensions may be positively correlated but not so highly that distinctions between dimensions are impossible.

An attempt has been made to identify the criterion constructs underlying performance across jobs. Campbell et al. (1993) argue that eight dimensions (i.e., Job Specific Technical Proficiency, Non Job Specific Technical Proficiency, Written and Oral Communication, Demonstrating Effort, Maintaining Personal Discipline, Facilitating Peer and Team Performance, Supervision Leadership, Management Administration) reasonably summarize the performance requirements for all jobs in the US economy. Not every job has every dimension as relevant, but the eight dimensions as a set reflect all performance requirements across these jobs. Criterion research in Project A, a large-scale selection study in the US Army, empirically confirmed several of these dimensions for 19 jobs in the Army. More will be said about these kinds of criterion models in the section entitled Future.

A final issue in this area focuses on the boundaries of ‘work performance.’ How should we define performance? Recent attention has been directed toward organizational citizenship behavior (OCB), prosocial organizational behavior, and related constructs (e.g., Organ 1997) as criterion concepts that go beyond task performance and the technical proficiency-related aspects of performance. For example, the OCB concept includes behavior related to helping others in the organization with their jobs and conscientiously supporting and defending the organization. Research has shown that supervisors making overall performance judgments about employees weight OCB and similar constructs about as highly as these employees’ task performance (e.g., Motowidlo and Van Scotter 1994). Also, a link between OCB on the part of organization members and organizational effectiveness has some empirical support (e.g., Podsakoff and MacKenzie 1997).

3. Methods Of Measuring Criterion Performance

Two major types of measures are used to assess criterion performance: ratings and so-called objective measures. Ratings—estimates of individuals’ job performance made by supervisors, peers, or others familiar with their performance—are by far the most often used criterion measure (Landy and Farr 1980). Objective measures such as turnover and production rates will also be discussed.

3.1 Performance Ratings

Performance ratings can be generated for several purposes, including salary administration, promotion and layoff decisions, employee development and feedback, and as criteria in validation research. Most of the research on ratings has focused on the latter for-research-only ratings. Research to be discussed is in the areas of evaluating the quality of ratings, format effects on ratings, and different sources of ratings (e.g., supervisors, peers).

3.1.1 Evaluation Of Ratings. Ratings of job performance sometimes suffer from psychometric errors such as distributional errors or illusory halo. Distributional errors include leniency severity where raters evaluate ratees either too high or too low in comparison to their actual performance level. Restriction-in-range is another distributional error. With this error, a rater may rate two or more rates on a dimension such that the spread (i.e., variance) of these ratings is lower than the variance of the actual performance levels for these ratees (Murphy and Cleveland 1995). Illusory halo occurs when a rater makes ratings on two or more dimensions such that the correlations between the dimensions are higher than the between-dimension correlations of the actual behaviors relevant to those dimensions (Cooper 1981).

A second common approach for evaluating ratings is to assess interrater reliability, either within rating source (e.g., peers) or across sources (e.g., peers and supervisor). The notion here is that high interrater agreement implies that the ratings are accurate. Unfortunately, this does not necessarily follow. Raters may agree in their evaluations because they are rating according to ratee reputation or likeability, even though these factors might have nothing to do with actual performance. In addition, low agreement between raters at different organizational levels may result from these raters’ viewing different samplings of ratee behavior or having different roles related to the ratees. In this scenario, each level’s raters might be providing valid evaluations, but for different elements of job performance (Borman 1997). Accordingly, assessing the quality of ratings using interrater reliability is somewhat problematic. On balance, however, high interrater reliability is desirable, especially within rating source.

A third approach sometimes suggested for evaluating ratings is to assess their validity or accuracy. The argument made is that rating errors and interrater reliability are indirect ways of estimating what we really want to know. How accurate are the ratings at reflecting actual ratee performance? Unfortunately, evaluating rating accuracy requires comparing them to some kind of ‘true score,’ a characterization of each ratee’s actual performance. Because determining true performance scores in a work setting is typically impossible, research on the accuracy of performance ratings has proceeded in the laboratory. To evaluate accuracy, written or videotaped vignettes of hypothetical ratees have been developed, and target performance scores on multiple dimensions have been derived using expert judgment. Ratings of these written vignette or videotaped performers can then be compared to the target scores to derive accuracy scores (Borman 1977).

3.1.2 Rating Formats. A variety of different rating formats have been developed to help raters evaluate the performance of individuals in organizations. Over the years, some quite innovative designs for rating formats have been introduced. In this section we discuss two of these designs: numerical rating scales and behaviorally anchored rating scales.

It may seem like an obvious approach now, but the notion of assigning numerical ratings in evaluating organization members was an important breakthrough. Previously, evaluations were written descriptions of the ratee’s performance. The breakthrough was that with numerical scores, ideally, well informed raters could quantify their perceptions of individuals’ job performance, and the resulting scores provide a way to compare employees against a standard or with each other.

One potential disadvantage of numerical scales is that there is no inherent meaning to the numbers on the scale. To address this difficulty, Smith and Kendall (1963) introduced the notion of behaviorally anchored rating scales (BARS). These authors reasoned that different levels of effectiveness on rating scales might be anchored by behavioral examples of job performance (e.g., always finds additional productive work to do when own normally scheduled duties are completed—high performance level on Conscientious Initiative rating dimension). The behavioral examples are each scaled according to their effectiveness levels by persons knowledgeable about the job and then placed on the scale at the points corresponding to their respective effectiveness values, helping raters to compare observed performance of a rater with the behavioral anchors on the scale, in turn leading to more objective behavior-based evaluations.

3.2 Rating Sources

Supervisors are the most often used rating source in obtaining performance evaluations of organization members. An advantage of supervisory ratings is that supervisors typically are experienced in making evaluations and have a good frame of reference perspective from observing large numbers of subordinates. A disadvantage is that in at least some organizations supervisors do not directly observe ratee performance on a day-to-day basis. In addition, coworker or peer ratings are sometimes used to evaluate performance. A positive feature for peer ratings is that coworkers often observe employee performance more closely and more regularly than supervisors. A difficulty is that coworkers are less likely than supervisors to have experience evaluating ratee performance.

Other rating sources may also be involved in assessing job performance. In fact, a recent emphasis has been toward the concept of ‘360 ratings,’ augmenting supervisory ratings with evaluations from peers, subordinates, self-ratings, and even customers’ assessments. The general idea is to assess performance from multiple perspectives so that a balanced evaluation of job performance might be obtained (Bracken 1996).

3.3 Objective Criteria

A second major measurement method for criterion performance involves use of objective criteria. Objective criteria employed in personnel research include turnover, production rates, and work samples of employee performance. At first glance, one may presume that objective criteria are more desirable than ratings, which are inherently subjective. Unfortunately, judgment often enters into the assignment of objective criterion scores. Also, objective measures are often contaminated as criteria, with problems such as factors beyond the assessee’s control influencing these outcome measures. Nonetheless, when they are relevant to important performance areas and are reasonably reliable and uncontaminated, objective measures can be useful in indexing some criterion dimensions.

3.3.1 Turnover. Turnover or attrition is often an important criterion because the cost of training replacement personnel is usually high (e.g., Mobley 1982); also, having people, especially key people, leave the organization can be disruptive and can adversely affect organizational effectiveness. Turnover is sometimes treated as a single dichotomous variable—a person is either a ‘leaver’ or a ‘stayer.’ This treatment fails to distinguish between very different reasons for leaving the organization (e.g., being fired for a disciplinary infraction versus leaving voluntarily for health reasons). Clearly, turnover for such different reasons will have different patterns of relationships with individual difference or organizational factor predictors. Prediction of turnover with any substantive interpretation requires examining the categories of turnover. It may be, for example, that employees fired for disciplinary reasons have reliably different scores on certain personality scales; e.g., lower socialization, compared to stayers, whereas prior health status is the only predictor of leaving the organization for health reasons. This approach to dealing with turnover as a criterion variable, appears to offer the most hope for learning more about why individuals leave organizations and what can be done to reduce unwanted turnover.

3.3.2 Production Rates. For jobs that have observable, countable products that result from individual performance, a production rate criterion is a compelling bottom-line index of performance. However, considerable care must be taken in gathering and interpreting production data. For example, work-related dependencies on other employees’ performance or on equipment for determining production rates may create bias in these rates. Also, production standards and quota systems (e.g., in data entry jobs) create problems for criterion measurement.

3.4 Work Sample Tests

Work sample or performance tests (e.g., Hedge and Teachout 1992) are sometimes developed to provide criteria, especially for training programs. For example, to help evaluate the effectiveness of training, work samples may be used to assess performance on important tasks before and after training. Such tests can also be used for other personnel research applications, such as criteria in selection studies.

Some argue that work sample tests have the highest fidelity for measuring criterion performance. In a sense, the argument is compelling: What could be more direct and fair than to assess employees’ performance on a job by having them actually perform some of the most important tasks associated with it? In fact, some researchers and others may view work samples as ultimate criteria—that is, the best and most relevant criterion measures. Performance tests should not be thought of in this light. First, they are clearly maximum performance rather than typical performance measures. As such, they tap the ‘can-do’ more than the ‘will-do’ performance-over-time aspects of effectiveness. Yet ‘will-do,’ longer-term performance is certainly important for assessing effectiveness in jobs. Accordingly, these measures are deficient when used exclusively in measuring performance. Nonetheless, work samples can be a useful criterion for measuring aspects of maximum performance.

3.5 Conclusions

The following conclusions can be drawn about criteria and criterion measurement methods.

3.5.1 Performance Ratings. Ratings have the inherent potential advantage of being sensitive to ratee performance over time and across a variety of job situations. Ideally, raters can average performance levels observed, sampling performance-relevant behavior broadly over multiple occasions. Provided observation and subsequent ratings are made on all important dimensions, ratings can potentially avoid problems of contamination and deficiency. Of course, these are potential advantages of the rating method. Rater error, bias, and other inaccuracies, must be reduced in order for ratings to realize this potential. Both basic and applied research are needed to learn more about what performance ratings are measuring and how to improve on that measurement.

3.5.2 Objective Measures. Like other methods of measuring criterion performance, objective measures can be useful. However, these measures are almost always deficient, contaminated, or both. Indices such as turnover and production rates produce data pertinent to only a portion of the criterion domain. In addition, some of these indices are often determined in part by factors beyond the employee’s control.

Regarding work sample tests, the point was made that these measures should not in any sense be considered as ultimate criteria. Nonetheless, well-conceived and competently constructed performance tests can be valuable measures of maximum, ‘can-do’ performance.

3.6 The Future

An encouraging development in criterion measurement is the consideration of models of work performance (e.g., Campbell et al. 1993). As mentioned, models of performance seek to identify criterion constructs (e.g., Communication, Personal Discipline) that reflect broadly relevant performance requirements for jobs. Such criterion models can help to organize accumulating research findings that link individual differences (e.g., ability and personality), organizational variables (e.g., task characteristics), and the individual criterion constructs identified in the models. In addition, efforts should continue toward learning more about what performance ratings and objective performance indexes are measuring, with the goal of improving the accuracy of these measures. Improving the measurement of work performance is critically important for enhancing the science and practice of industrial–organizational psychology toward the broad and compelling objective of increasing the effectiveness of organizations.


  1. Borman W C 1977 Consistency of rating accuracy and rater errors in the judgment of human performance. Organizational Behavior and Human Performance 20: 238–52
  2. Borman W C 1997 360 ratings: An analysis of assumptions and a research agenda for evaluating their validity. Human Resource Management Review 7: 299–315
  3. Bracken D W 1996 Multisource 360 feedback: Surveys for individual and organizational development. In: Kraut A I (ed.) Organizational Surveys. Jossey-Bass, San Francisco, pp. 117–47
  4. Campbell J P, McCloy R A, Oppler S H, Sager C E 1993 A theory of performance. In: Schmit N, Borman W C (eds.) Personnel Selection in Organizations. Jossey-Bass, San Francisco, pp. 35–70
  5. Cooper W H 1981 Ubiquitous halo. Psychological Bulletin 90: 218–44
  6. Hedge J W, Teachout M S 1992 An interview approach to work sample criterion measurement. Journal of Applied Psychology 77: 453–61
  7. Landy F J, Farr J L 1980 Performance rating. Psychological Bulletin 87: 72–107
  8. Mobley W 1982 Some unanswered questions in turnover and withdrawal research. Academy of Management Review 7: 111–16
  9. Motowidlo S J, Van Scotter J R 1994 Evidence that task performance should be distinguished from contextual performance. Journal of Applied Psychology 79: 475–80
  10. Murphy K R, Cleveland J N 1995 Understanding Performance Appraisal. Sage, Thousand Oaks, CA
  11. Organ D W 1997 Organizational citizenship behavior: it’s construct clean-up time. Human Performance 10: 85–97
  12. Podsakoff P M, MacKenzie J B 1997 Impact of organizational citizenship behavior on organizational performance: a review and suggestions for future research. Human Performance 10: 133–51
  13. Schmidt F R, Kaplan L B 1971 Composite versus multiple criteria: a review and resolution of the controversy. Personnel Psychology 24: 419–34
  14. Smith D E, Kendall L M 1963 Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology 47: 149–55


Professions In Organizations Research Paper
People In Organizations Research Paper


Always on-time


100% Confidentiality
Special offer! Get 10% off with the 24START discount code!