Sample Outcome-Based Sampling Research Paper. Browse other research paper examples and check the list of research paper topics for more inspiration. If you need a religion research paper written according to all the academic standards, you can always turn to our experienced writers for help. This is how your paper can get an A! Feel free to contact our research paper writing service for professional assistance. We offer high-quality assignments for reasonable rates.

## 1. Introduction

An outcome-based sample in an observational study is one obtained by stratiﬁcation on the basis of response behavior whose explanation is the sarget of study. Observations on response and explanatory variables (covariates) are collected within each stratum. These are then used for statistical inference on the conditional distribution of the response, given the covariates. For example, a study of occupational choice may draw a sample stratiﬁed by occupation, so the ﬁrst stratum is a sample of engineers, the second stratum is a sample of educators, and so forth. Data are collected on covariates such as gender and utilization of training subsidies. The observations are then used to infer the impact of training subsidies on occupational choice.

## Need a Custom-Written Essay or a Research Paper?

### Academic Writing, Editing, Proofreading, And Problem Solving Services

Outcome-based sampling is also termed endogenous, intercept, on-site, experience-based, response-based, or choice-based sampling, and in biostatistics, retrospective or case-control sampling. To illustrate, case-control designs sample from one stratum that has a speciﬁed outcome (the cases) and a second stratum that does not (the controls). There may be further stratiﬁcation of the controls so that they resemble the cases on some covariates such as age. Outcome-based samples can be contrasted with exogenous or prospective samples that sample randomly from the target population, or stratify solely on covariates. Outcome-based sampling methods have had wide application in social science, biostatistics, and engineering; examples are Akin et al. (1998), Domowitz and Sartain (1999), Laitila (1999), and van Etten et al. (1999).

## 2. Design And Robustness Issues In Endogenous Samples

Outcome-based samples are useful for studying rare responses that are difficult to observe in random samples of practical size, such as risk factors for a rare disease, and for studying outcomes that randomly sampled subjects are unable or unwilling to report. An example of the second type is the study of mental illness as a risk factor for homelessness, where the only practical sample methodology may be one that surveys cases who are homeless and controls who are not (Early 1999, Lacy 1997). Outcome-based sampling may also offer economies in recruiting and interviewing subjects. For example, consumers who purchase and register a product form an outcome-based stratum that the manufacturer can survey inexpensively.

Outcome-based samples may arise by deliberate design, as in case-control studies, but can also be unintentional. General purpose surveys often use cluster sampling to facilitate enumerating and contacting subjects. The cluster stratiﬁcation is exogenous for most purposes, but has the properties of an outcome-based sample for behavior that is related to the stratiﬁcation. For example, cluster sampling by census tract or area code becomes outcome based in studies of residential location of migration behavior. Further, screening, self-selection, and attrition create statistical problems like those in outcome-based samples. For example, a sampling protocol that selects subjects by random-digit dialing will over-sample individuals with more than one telephone line, resulting in a sample that is endogenous for study of the demand for telephone service. Differential attrition in which, for example, individuals who live alone are more difficult to contact or more frequently refuse to be interviewed, will produce a sample that is endogenous for the study of living arrangements.

Outcome-based samples are less robust than random samples. Because stratum boundaries coincide with the behavioral categories under study, variations in sampling protocol across strata are confounded with behavioral response. For example, in case-control studies of risk factors for disease, variations in question format between cases and controls, or even the context for questions induced by the disease status, can alter recall on exposure to hazards. Unlike prospective studies where randomization of treatments and double-blind designs can minimize the risks of contamination, retrospective studies are vulnerable to investigator bias and latent causal factors that are correlated with measured risk factors. As a result, differences in reported exposure to risk factors between cases and controls may arise from spurious sampling effects as well as from a true causal effect. The difficulty of avoiding contaminating bias in casecontrol settings is discussed further in Austin et al. (1994), Feinstein (1979), Lilienfeld and Lilienfeld (1979), and Taubes (1995).

## 3. Statistical Inference In Endogenous Samples

Statistical methods developed for random samples will often be inconsistent or inefficient when applied to endogenous samples. The essential problem is that the analysis is attempting to infer properties of the conditional distribution of outcomes given covariates, using observations that are drawn from conditional distribution of covariates given outcomes. The solution to the inference problem is to incorporate the mapping between the conditional distributions in the analysis, either by reweighting observations so that they behave as if they were drawn from a random sample, or by reweighting the probability model for a random sample so that it is consistent with the empirical sampling process. The statistical issues in analyzing outcome-based samples were treated in a seminal paper by Manski and Lerman (1977), with further results by Manski and McFadden (1981) and Cosslett (1981). A largely independent treatment of logistic regression in the case-control model appeared in biostatistics, notably in the papers of Anderson (1972) and Prentice and Pyke (1979). Recent papers by Hsieh et al. (1985) and Breslow (1996) sharpen the statistical analysis of case-control designs and show their relationship to the general problem of endogenous sampling.

Table 1 is a contingency table that depicts, schematically, the population probability law for an outcome variable y and a vector of covariates z. (This exposition treats y and z as discrete, but the discussion applies with minor modiﬁcations to the case where y and/or some components of z are continuous.) The joint probability of a ( y, z) cell can be written as the product of the conditional probability of y given z times the marginal probability of z, p( y, z) = P(y|z)•p(z). The row sums give the marginal probability p(z) of z, and the column sums give the marginal probability q( y) =∑_{z} P(y|z)•p(z) of y. Dividing a cell probability by its column sum gives the conditional probability of z given y, Q(z|y) = P(y|z)p(z)/q( y). This mapping between conditional probabilities is Bayes’ Law. The target of statistical analysis is the conditional probability P( y|z), sometimes termed the response probability. In applications, P(y|z) is usually assumed to be invariant under treatments that alter the marginal probability of z; then knowledge of P(y|z) permits the analysis to forecast y in new populations or under policy treatments where the z distribution is changed. (A conditional probability with this invariance property is sometimes said to deﬁne a causal model. It is true that a causal structure will imply this invariance property, but it is also possible for the invariance property to hold, making forecasting possible, without the presence of a deeper causal structure. Further, there are straightforward statistical tests for the invariance property, while detection of true causal structures is beyond the reach of statistics. For these reasons, it is best to avoid the language of causality and concentrate instead on invariance properties.)

Random sampling draws from the table in proportion to the cell probabilities. Exogenous stratiﬁcation draws rows, with probabilities that may differ from the population marginal probabilities p(z), and then within a row draws columns in proportion to their population conditional probabilities P(y|z). A simple outcome-based sampling design draws columns, with probabilities that may differ from the population marginal probabilities q(y), then within a column draws rows in proportion to their conditional probabilities Q(z|y) =P(y|z)p(z)/q( y).

More complex endogenous sampling designs are also possible. A general framework that permits a uniﬁed analysis of many sampling schemes characterizes the sampling protocol for a stratum s in terms of a probability R(y, z, s) that a member of the population in cell ( y, z) will qualify for the stratum. The joint probability that a member of the population is in cell ( y, z) and will qualify for stratum s is then R(z, y, s) P(y|z, β_{0})•p(z). Then the proportion of the population qualifying into the stratum, or qualiﬁcation factor, is r(s) z yR(z, y, s) P(y z) p(z), and the conditional distribution of (z, y) given qualiﬁcation is R(z, y, s)•P(y|z)•p(z)/r(s). When a fraction of the sample f (s) is drawn from stratum s, the probability law for an observation from the pooled sample is g(y, z) = ∑_{s}R(z, y, s) P(y|z) p(z)f(s)/r(s). The conditional distribution of y given z in this pooled sample is

Note that this conditional probability depends on the marginal distribution of z only through the qualiﬁcation factors.

When the sampling protocol is exogenous (i.e., R(y, z, s) does not depend on y), the conditional probability g(y|z) for the pooled sample equals the population conditional probability P(y|z). Consequently, any statistical inference procedure designed to reveal features of the conditional probability P(y|z) in random samples will apply to an exogenously stratiﬁed sample. In particular, if P(y|z) is in a parametric family, then maximization of the random sample likelihood function in an exogenously stratiﬁed sample will have the same properties as in a random sample. However, in an endogenous sample in which the qualiﬁcation probability R( y, z, s) does depend on y, the conditional probability g(y|z) for the pooled sample is not equal to P(y|z). Consequently, statistical inference assuming that the data generation process is described by P(y|z) is generally statistically inconsistent. Also, the distribution of covariates in an endogenous sample will differ from their population distribution, with

and a corresponding correction factor must be applied to the sample empirical distribution of z to estimate population quantities consistently.

Manski and McFadden (1981) propose that statistical inference when P( y|z) is parametric be based on the conditional likelihood g( y|z), and term this the conditional maximum likelihood (CML) method. When the qualiﬁcation factors r(s) and sample frequencies f (s) are known or can be estimated consistently from external samples, and the forms of P( y|z) and R(z, y, s) allow identiﬁcation of any un- known parameters in R(z, y, s), this approach is consistent. In general, the probability g( y|z) is not in the same parametric family as P( y|z). To illustrate, suppose a population has a binomial probit dose-response curve, P(2|z) = Φ(α+zβ), and P(1|z) = 1 – Φ(α + zβ). Suppose the sample consists of a randomly sampled stratum 1 with R(z, y, 1) = 1, plus a stratum 2 drawn from the population with response y = 2, with R(z, y, 2) equal to one if y = 2, and zero otherwise. This is called an enriched sample. The qualiﬁcation factors are r(1) = 1 and r(2) = q(2). If q(2) is known, a consistent estimate of the slope parameter β in the dose-response curve can be obtained by the CML method with

By contrast, likelihood maximization using P(y|z) is not consistent for β.

An important simpliﬁcation of the CML method occurs for what in biostatistics is termed logistic regression in case-control designs. Suppose that the vector of covariates is partitioned into components z= (v, x) with discrete. (In biostatistics, will often include variables such as age and gender whose distributions are matched between cases and controls.) Suppose that P(y|v, x) has a multinomial logit form,

In this model, the β_{y} are slope coefficients for the covariates x, and α_{y} and γ_{yv }are response-speciﬁc effects and interactions of response-speciﬁc and v-speciﬁc effects. Suppose that the qualiﬁcation probability R(v , x, y, s) does not depend on x. For identiﬁcation, a normalization such as α_{1} = γ_{1v} = β = 0 is imposed. The conditional probability g(y z) is again of multinomial logit form, with the same β_{y} parameters but with the remaining parameters shifted from their population values by sampling factors,

with

Note that consistent estimation of this model requires the inclusion of all the alternative speciﬁc effects and interactions that are modiﬁed by sampling factors. However, if these variables are included, then the slope parameters β_{y} are estimated consistently without further adjustments for endogenous sampling. (If the raising factors are estimated rather than known, there is an additional contribution to the asymptotic co- variance matrix (Hsieh et al. 1985). If the model had included interactions of and x, then the coefficients on these interactions would contain sampling factors that must be removed to obtain consistent estimates of the corresponding population interactions). The simpliﬁcation above was ﬁrst noted by Anderson (1972).

Another solution to inference in choice-based samples is the weighted exogenous sample maximum likelihood (WESML) method proposed by Manski and Lerman (1977) in their seminal paper on this subject. This solution weights the observations, with a weight for (y, z) that is inversely proportional to ∑_{s}R(z, y, s) f(s)/r(s). When these weights are ﬁnite and can be calculated or estimated from external data, consistent estimates can be obtained when the weighted observations are treated as if they were drawn from an exogenous sample. This is a pseudomaximum likelihood method, and the associated covariance matrix will be of the sandwich form, with an outer product of the pseudo-score in the middle and the inverse of the pseudo-information matrix on the outside. For the case of pure outcome-based sampling, where the strata correspond to the possible values of y and R(y, z, s) is one when y = s and zero otherwise, the weights are r(y)/f(y), the ratio of the population to the sample frequency of observing y.

A third approach to inference in outcome-based samples, developed by Cosslett (1981), achieves a semiparametric efficiency bound when P(y|z) is parametric. This estimator is derived by maximum likelihood estimation that ﬁrst formally concentrates out the marginal probability p(z). The result is an objective function similar to g(y|z), but with parametric sampling factors that satisfy side conditions. Hsieh et al. (1985) adapt Cosslett’s analysis to improve the efficiency of the CML method in the presence of auxiliary information.

## 4. Extensions And Related Issues

Closely related to outcome-based sampling in the form described here is the problem of analyzing samples that are subject to self-selection. Several strands in the statistical and econometric literature have investigated estimators appropriate to such data: a seminal paper by Heckman (1974) on sample selection, further work on endogenous stratiﬁcation by Hausman and Wise (1977), and related work on switching regression by Goldfeld and Quandt (1975).

Extensions of the basic framework for inference in outcome-based samples have been made for a variety of problems. Breslow (1996) provides methods for complex case-control sample designs. Imbens (1992) provides methods for combining outcome-based survey data with aggregate statistics. Imbens and Lancaster (1996) and McFadden (in press) have studied the problem of analysis of endogenously recruited panels.

## 5. Conclusions

Samples in observational studies often fail to qualify as random or exogenously stratiﬁed, and instead have the properties of endogenous or outcome-based samples. This may occur inadvertently due to attrition, self-selection, or stratiﬁcation that is correlated with the response being studied, or may result from deliberate design decisions to facilitate observation, as in case-control studies. This research paper has described the robustness issues and inference methods that are relevant for these samples.

**Bibliography:**

- Akin J et al. 1998 Price elasticities of demand for curative health care with control for sample selectivity on endogenous illness: An analysis for Sri Lanka. Health Economics 7: 509–31
- Anderson J 1972 Separate sample logistic discrimination. Biometrika 59: 19–35
- Austin H, Hill H, Flanders W, Greenberg R 1994 Limitations in the application of case-control methodology. Epidemiologic Reviews 16: 65–76
- Breslow N 1996 Statistics in epidemiology: The case-control study. Journal of the American Statistical Association 91: 14–28
- Cosslett S 1981 Maximum likelihood estimation for choicebased samples. Econometrica 49: 1289–316
- Domowitz I, Sartain R 1999 Determinants of the consumer bankruptcy decision. Journal of Finance 54: 403–20
- Early D 1999 A microeconomic analysis of homelessness: An empirical investigation using choice-based sampling. Journal of Housing Economics 8: 312–27
- Feinstein A 1979 Methodological problems and standards in case-control research. Journal of Chromic Diseases 32: 35–41
- Goldfeld S, Quandt R 1975 Estimation in a disequilibrium model and the value of information. Journal of Econometrics 3(4): 325–48
- Hausman J, Wise D 1977 Social experimentation, truncated distributions, and efficient estimation. Econometrica 45(4): 919–38
- Heckman J 1974 Shadow prices, market wages, and the labor supply. Econometrica 42: 679–94
- Hsieh D, Manski C, McFadden D 1985 Estimation of response probabilities from augmented retrospective observations. Journal of the American Statistical Association 80: 651–62
- Imbens G 1992 An efficient method of moments estimator for discrete choice models with choice-based sampling. Econometrica 60: 1187–214
- Imbens G, Lancaster T 1996 Efficient estimation and stratiﬁed sampling. Journal of Econometrics 74: 289–318
- Lacy M 1997 Efficiently studying rare events: Case-control methods for sociologists. Sociological Perspectives 40: 129–154
- Laitila T 1999 Estimation of combined site-choice and trip frequency models of recreational demand using choice-based and on-site samples. Economics Letters 64: 17–23
- Lilienfeld A, Lilienfeld D 1979 A century of case-control studies. Progress? Journal of Chronic Diseases 32: 5–13
- Manski C, Lerman S 1977 The estimation of choice probabilities from choice-based samples. Econometrica 45: 1977–88
- Manski C, McFadden D 1981 Alternative estimators and sample designs for discrete data analysis. In: Manski C, McFadden D (eds.) Structural Analysis of Discrete Data with Econometric Applications. pp. 51–111
- McFadden D in press On endogenously recruited panels. Journal of Applied Econometrics.
- Prentice R, Pyke R 1979 Logistic disease incidence models and case-control studies. Biometrica 66: 403–11
- Taubes G 1995 Epidemiology faces its limits. Science 269: 164–9
- van Etten M, Neumark Y, Anthony J 1999 Male-female differences in the earliest stages of drug involvement. Addiction 94: 1413–19