Generalizability Theory Research Paper

Academic Writing Service

Sample Generalizability Theory Research Paper. Browse other research paper examples and check the list of research paper topics for more inspiration. If you need a research paper written according to all the academic standards, you can always turn to our experienced writers for help. This is how your paper can get an A! Feel free to contact our custom research paper writing service for professional assistance. We offer high-quality assignments for reasonable rates.

1. Overview

Generalizability theory (G theory) provides a framework for conceptualizing, investigating, and designing reliable observations. It was originally introduced by Cronbach and colleagues (1963, 1972) in response to limitations of the popular true-score-model of classical reliability theory (Spearman 1904, 1910). Classical reliability theory centers around the notion that each observation or test score has a single true score, belongs to one family of parallel observations, and yields a single reliability coefficient (Nunnally and Bernstein 1994). While this model may be reasonable for carefully equated parallel forms of tests, it is overly restrictive and often unrealistic in situations where, for instance, raters differ in the central tendency and variance, observations depend on the context in which they occur, and constructs are patently heterogeneous. Several writers argued in the 1950s and 1960s that the same observation could conceivably belong to more than one set of parallel tests and, thus, have more than one reliability coefficient (Cronbach et al. 1963, Guttman 1953, Tryon 1957). The fact that the internal consistency reliability estimate of a multidimensional measure tends to be low even when test–retest and alternate forms reliability estimates are high illustrates the contradictions and limitations of the classical reliability model.

Academic Writing, Editing, Proofreading, And Problem Solving Services

Get 10% OFF with 24START discount code


It was Cronbach’s vision to reinterpret classical reliability theory as a theory regarding the adequacy with which one can generalize from a sample of observations to the universe of observations that was randomly sampled. G theory acknowledges that the reliability of an observation depends on the universe about which the investigator wants to draw inferences. Because a particular measure may, conceivably, be generalized to many different universes, a measure may vary in how reliably it permits inferences about these universes and, therefore, be associated with different reliability coefficients. G theory explicitly requires investigators to specify a universe of conditions over which they wish to generalize. The extent to which different conditions are associated with different observations has implications for designing dependable observations.

G theory not only reinterprets classical reliability theory, but also shows how the traditional distinction between reliability and validity can be overcome to design dependable observations. Traditionally, construct validity is concerned with drawing inferences about a latent construct based on observable measures, whereas reliability is concerned with drawing inferences about a true score from observations across parallel measures. In G theory, a universe, its facets, and the conditions for admissible observations are defined through careful construct explication, the traditional domain of validity theory. G theory defines observations as dependable if they permit accurate inferences about the latent construct (i.e., the universe of admissible observations) they are meant to represent. In so doing, G theory blurs the traditional distinction between reliability and validity. The use of the terms ‘dependability’ and ‘generalizability’ instead of ‘reliability’ reflects the interest in unifying reliability and validity. How to go about designing and investigating dependable measures is the subject of G theory.




2. Key Concepts

2.1 The Decision Maker

Measurements are designed with applications in mind, and their quality has to be examined in the context of their application. This context is symbolized by the ‘decision maker’ who is interested in measuring a particular construct in a particular population of persons, under a particular universe of conditions, and with particular types of decisions in mind.

2.2 Types Of Decisions

The types of decisions for which a measurement instrument is to be used must be known before an instrument that yields dependable observations can be designed. Two major types of decisions have to be distinguished. The first concerns decisions that rely on the relative ranking of individuals. Accepting the three top-scoring applicants for a position is an example of a decision that relies on the relative interpretation of scores. Measurements that emphasize the interpretation of the relative differences between scores are the domain of classical test theory with its traditional focus on interindividual differences.

The second major type of decision is based on the interpretation of the absolute level of scores. Mastery tests in education provide an example of a decision based on the absolute interpretation of a score. Similarly, pre-established admissions standards for college (e.g., minimum SAT score) or pass fail standards on job selection tests rely on the interpretation of absolute score levels.

2.3 The Universe Of Admissible Observations

A measurement consists of a sample from the universe of admissible observations. What defines a universe for a particular measurement is based on what the decision maker is willing to treat as interchangeable for the purposes of making a decision. Thus, a universe may consist of interchangeable three-letter syllables, symptoms of depression, or interviewers if the decision maker is willing to treat as interchangeable three-letter syllables, depressive symptoms, and interviewers. Or, a universe may consist of interchangeable time points for urine collection, survey questions about voting behavior, or situations for observing altruistic behaviors if the decision maker considers those aspects of measurement to be interchangeable.

2.4 Estimating A Person’s Universe Score

Given a particular universe of admissible observations, a person’s universe score ( µp) can be defined as the average score based on all admissible observations (Xpi) of the universe of interest. The purpose of a measurement is to accurately estimate this universe score ( µp) based on a sample of observations.

The degree to which a measurement is generalizable depends on how accurately the measurement permits us to estimate the universe score. The accuracy of an observation is captured in the variance associated with the different presumably interchangeable observations. The extent to which different items or different observers or different measurement occasions yield different observations determines how dependable a single specific observation is. In other words, if different items, observers, etc., yield similar observations, a single measurement may allow accurate inferences about the universe of observations. If different items, observers, occasions, etc., yield dissimilar observations, inferences about the underlying universe based on a single observation may be questionable.

2.5 The Universe And Its Facets

Universes can be simple or complex, homogenous or heterogeneous, and small or large depending on the construct of interest and the decision maker’s interest in investigating different aspects or facets of generalizability. The terms ‘facets’ and ‘conditions’ are analogous to ‘factors’ and ‘levels’ in the literature on experimental design. A universe of observations is said to consist of one facet if the generalizability of observations regarding one source of variation in the universe, say arithmetic questions of varying difficulty, is at stake. For instance, a particular arithmetic test includes a sample of items, covering different addition, subtraction, multiplication, and division problems of one-and two-digit numbers. The decision maker is interested in general arithmetic achievement, and is indifferent to the particular questions on the test. In the one-facet design, one is interested in estimating the universe score of each person based on the sample of items included in the test. A universe is said to have two (or more) facets if the generalizability of observations regarding two (or more) sources of variation in the universe, say items and occasions, is at stake. For instance, to the extent that arithmetic achievement depends on the difficulty of the items and the test occasions (e.g., beginning or end of school year), the generalizability of a particular measure taken early in the school year on a sample of difficult items may be compromised.

2.6 Generalizability Studies

Generalizability studies (G studies) are conducted to investigate the relationship between an observed score and a universe score. They are designed to provide information regarding the sources of variability (i.e., facets of the universe) that influence the generalizability of observations. Because the dependability of a measure rests on its purpose (and thus the universe of admissible observations it is meant to represent), different G studies may be needed depending on the proposed application of the measure. Alternatively, a single G study has to anticipate the multiple uses of a measurement to provide as much information as possible about potentially important sources of variation.

In the above example, arithmetic items varying in difficulty is an obvious choice of a facet to be examined in a G study. In addition, decision makers may be interested in whether measures taken at different occasions, based on, for example, different response formats or different presentation modes yield different observations. To examine these issues, a universe has to be defined more broadly to include items of different difficulty, response formats, presentation modes, and taken at different occasions. Having defined such a multifaceted universe, a G study has to be designed to investigate the contributions of these facets and their interactions to the overall variance in observations.

2.7 Decision Studies

G studies are designed to assess the dependability of a particular measurement technique. In contrast, decision studies (D studies) are designed to gather data based on which decisions about individuals are made. D studies rely on the evidence generated by G studies to design dependable measures for a particular decision and a particular set of facets about which a decision maker would like to generalize. The goal is to design a measure that samples sufficient numbers of instances from different facets of a universe of observations to yield a sufficiently dependable estimate of the universe score which the measure is meant to represent.

3. Illustrative Generalizability Design

3.1 Generalizability Design

For illustration, the statistical models and analyses for a one-facet G study will be presented. This most simple G design requires making observations (Xpi) on individual subjects (1, 2, …, p, …, np) under different conditions of a facet (1, 2, …, i, …, ni), where subjects and conditions represent random samples from their respective universes. In a one-facet design, the universe of admissible observations consists of observations that differ with respect to one characteristic, such as item difficulty. The generalizability question is: How dependable are observations made under different conditions to draw inferences about the universe consisting of all conditions?

3.2 The Statistical Model

A one-factorial ANOVA model can be used to describe how the observed score of person p under condition i can be partitioned into a grand mean ( µ), an effect for each person (πp), an effect for each item (αi), and a residual (εpi).

Generalizability Theory Research Paper

In Eqn. (1), the grand mean ( µ) is a constant value for all persons, positioning the score for all persons on the particular scale of measurement. The person effect (πp) describes the difference between a person’s universe score and the grand mean (µp-µ), indicating inter-individual differences in universe scores. The item effect (αi) indicates the difference between an item’s difficulty and the average item difficulty ( µi-µ), reflecting the variability associated with the facet of error being investigated. The residual combines the influence of the person-by-item interaction, systematic sources of error other than items, and random events.

According to this model, specific observations Xpi vary due to differences between persons (i.e., person effects), differences between items (i.e., item effect), and other unaccounted for sources (i.e., residual). Accordingly (see E qn. (2)), the variance of a set of observed scores (σX2pi) ca n be partitioned into components due to persons (σ2p), items (σ2i), and the residual (σ2ε). The residual combines two sources of variance that cannot be distinguished from each other in this model. They are the person-by-item interaction (i.e., items vary in difficulty between persons; persons vary in their ranking on different items) and chance fluctuations in scores.

Generalizability Theory Research Paper

3.3 G Study: Estimation Of Variance Components

To estimate these variance components, a G study has to be conducted in which data are collected on a representative sample of persons and items. Variance estimates can be derived from the mean squares (MS) reported as part of the standard ANOVA printout of statistical analysis programs (see Eqns. (3)–(5)), in which persons are identified as between-subjects factor and items as within-subjects factor.

Generalizability Theory Research Paper

Generalizability Theory Research Paper

3.4 A Numerical Example

Table 1 presents the ANOVA summary table from a G study in which 1,140 persons rated the frequency with which they experienced 20 depressive symptoms. The table presents the numerical estimates and the percentage of total variance accounted for by persons, items, and the residual.

Generalizability Theory Research Paper

3.5 Interpretation Of Variance Components

Variance components reveal information about how different sources of variability affect the response to a single item. For instance, the variance component associated with items in the above example is 0.0230, accounting for 6 percent of the total variance. This variance component indicates how much the mean of any one randomly selected item is expected to vary from the mean of all items in the universe. Clearly, the magnitude of a variance component depends on how the items were scored (e.g., 1, 2, 3, 4 or 10, 20, 30, 40). In this example, items were scored 1, 2, 3, 4, where 1 indicates the absence of depressive symptoms and 4 indicates the symptoms were present all of the time. The average item mean (i.e., estimate of the mean of all items in the universe, µp) was 1.38 indicating that most respondents reported the absence of depressive symptoms. Taking the square-root of the variance component √0.023 yields 0.1517, indicating that we would expect approximately 95 percent of item means to fall between ±0.2972 of the overall mean (i.e., CI95% = [1.08; 1.68]). Given that the possible range of item scores is 1 to 4, the expected range indicates that the item means in the universe of admissible items will mostly vary between 1 and 2. The variance component for persons is 0.0596, or approximately 15 percent of the total variance. This variance component estimates the universe score variance, that is, the between-person variability in the average item scores across all the admissible items of the universe. Based on the variance component, the standard deviation is 0.2441, indicating that 95 percent of universe scores are expected to fall between ± 0.4885 of the grand mean (1.0, 1.86).

Similar to the reliability coefficient in classical measurement theory, G theory yields a coefficient of generalizability ( ρ2) based on a particular G design. It is defined as the ratio of universeand observed-score variances. In the is example, the G coefficient for a single item is ρ2p2Xpi2= 0.1522. Similarly, intraclass correlation coefficients can be determined. For one item it is α(1) =(MSp -MSres)/(MSp +(ni-1)M-Sres = 0.1617, and for the set of 20 items it is α(20) =(MSp− MSres)/MSp =0.7941. The latter is also known as the Cronbach α coefficient or KR20 formula.

By far the largest variance component is due to the residual (i.e., 0.3089), accounting for approximately 79 percent of the total variance. This suggests that there are important sources of variance not accounted for by differences between persons or differences in item difficulties (i.e., item means). Because the residual consists of the item-by-person interaction, this variance components reflects the fact that subjects indicate somewhat different levels of depressive symptoms depending on which item they are responding to. Based on the variance component, the standard deviation is 1.0893, indicating that the response to a specific item is expected to fall within ± 1.09 of a person’s universe score in 95 percent of observations.

In summary, the results show that responses to a single item are mostly influenced by measurement error. Information about inter-individual differences make up only 15 percent of the total variance, and information about differences in test items make up only 6 percent of the total variance in item responses. Clearly, a single test item is of little use in estimating a person’s universe score. Consequently, decision studies must sample many items to achieve more dependable estimates. Exactly how many items is the subject of D studies.

3.6 D Study: Designing Generalizable Observations For Decisions

To conduct a D study, the relevant universe of observations and its facets must have been investigated in the G study. The type of decisions to be made must be known and the accuracy necessary to make the decisions specified. With this information at hand, the number of items from each facet of the universe can be determined such that dependable inferences can be drawn about the entire universe of observations at stake.

For instance, the decision maker in the previous example may want to design a measure that yields a 95 percent margin of error of ±0.25 for the estimate of a person’s universe score. That is, the decision maker wants to be 95 percent confident that a person’s universe score is within ±0.25 of the observed score. Based on this 95 percent margin of error (i.e., 0.25 =1.96 * σerror), the error variance of the universe score can be determined: σ2error = (0.25/1.96)2=0.0163.

For decisions based on the relative ranking of scores, the required number of items is:

For decision based on the absolute ranking of scores, the required number of items is:

Thus, for relative decisions, we need to sample ni =0.3089/0.0163=18.95 or 19 items. And, for absolute decisions, we need to sample ni= (0.0230+0.3089/0.0163 =20.36 or 21 items. In this example, very little variance is contributed by differences between items in their difficulties (i.e., σi2), indicating that, regardless of which items are sampled, similar absolute scores will be obtained. Consequently, only two additional items have to be sampled to yield equally dependable estimates of universe scores for relative and absolute decisions.

4. Extensions

The principles presented here for a simple generalizability scenario can be extended to more complex designs consisting of two or more factors, which may be random or fixed, and which may be crossed or nested. The interested reader is referred to the works by Cronbach et al. (1972) and Shavelson and Webb (1991). G theory can also be applied to the analysis of score profiles, composites, and difference scores. Furthermore, extensions have been developed to estimate variance components via maximum likelihood, Bayesian, and covariance structure methods, and for studying the dependability of any facet of observation. The interested reader is referred to the work by Marcoulides (1996, 1999), Shavelson et al. (1989), Wittmann (1988), and Webb et al. (1983).

Bibliography:

  1. Cronbach L J, Gleser G C, Nanda H, Rajaratnam N 1972 The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. Wiley, New York
  2. Cronbach L J, Nageswari R, Gleser G C 1963 Theory of generalizability: A liberation of reliability theory. British Journal of Statistical Psychology 16: 137–63
  3. Guttman L 1953 A special review of Harold Gulliksen’s theory of mental tests. Psychometrika 18: 123–30
  4. Marcoulides G A 1996 Estimating variance components in generalizability theory: The covariance structure analysis approach. Structural Equation Modeling 3: 290–9
  5. Marcoulides G A 1999 Generalizability theory: Picking up where the Rasch IRT model leaves off. In: Embretson S E, Hershberger S L (eds.) The New Rules of Measurement: What Every Psychologist and Educator Should Know. Lawrence Erlbaum, Mahwah, NJ, pp. 129–52
  6. Nunnally J C, Bernstein I H 1994 Psychometric Theory, 3rd edn. McGraw Hill, New York
  7. Shavelson R J, Webb N M 1981 Generalizability theory: 1973- 1980. British Journal of Mathematical and Statistical Psychology 34: 133–66
  8. Shavelson R J, Webb N M 1991 Generalizability Theory: A Primer. Sage, Newbury Park, CA
  9. Shavelson R J, Webb N M, Rowley G L 1989 Generalizability theory. American Psychologist 44: 922–32
  10. Spearman C 1904 The proof and measurement of association between two things. American Journal of Psychology 15: 72–101
  11. Spearman C 1910 Correlation calculated from faulty data. British Journal of Psychology 3: 271–95
  12. Tryon R C 1957 Reliability and behavior domain validity: Reformulation and historical critique. Psychological Bulletin 54: 229–49
  13. Webb N M, Shavelson R J, Maddahian E 1983 Multivariate generalizability theory. New directions in Testing and Measurement 18: 67–82
  14. Wittmann W W 1988 Multivariate reliability theory: Principles of symmetry and successful validation strategies. In: Nesselroade J R, Cattell R B (eds.) Handbook of Multivariate Experimental Psychology, 2nd edn. Plenum Press, New York, pp. 505–60
Grounded Theory Research Paper
Expert Systems In Cognitive Science Research Paper

ORDER HIGH QUALITY CUSTOM PAPER


Always on-time

Plagiarism-Free

100% Confidentiality
Special offer! Get 10% off with the 24START discount code!