Sample Population Size Estimation Research Paper. Browse other research paper examples and check the list of research paper topics for more inspiration. iResearchNet offers academic assignment help for students all over the world: writing from scratch, editing, proofreading, problem solving, from essays to dissertations, from humanities to STEM. We offer full confidentiality, safe payment, originality, and money-back guarantee. Secure your academic success with our risk-free services.

How many people were missed by the US census? How many injecting drug users are there in Glasgow? Homeless in Scotland? Industrial accidents in Maryland? Bugs in a computer program? How many words did Shakespeare know?

## Academic Writing, Editing, Proofreading, And Problem Solving Services

#### Get 10% OFF with 24START discount code

How do we estimate the sizes of such populations? If the population includes a subgroup of identiﬁable individuals and if their number is known, a simple approach is available. If a random sample can be taken from the whole population and the proportion of the subgroup that occur in the sample is calculated, that proportion can be used to scale up the sample size to provide an estimate of the unknown size of the population. Three important ‘ifs’ occur in the last two sentences that cause problems for this dual-system estimation (DSE) procedure, also known as capture(or mark-) recapture (CR) from its widespread use in the study of wildlife populations by means of animals caught, given marks, released and subsequently recaptured.

Both in this simplest form, and in the many extensions proposed to overcome the problems posed by these ‘ifs’ in the real world, this CR method has many relationships with other statistical methods. The problem is one of sampling from a ﬁnite population. If an indicator variable y is deﬁned to be 1 for an individual in that population, 0 otherwise, the unknown size of that unique population is the population total Y of y. The CR estimate is the classical ratio estimate based on the auxiliary indicator variable x which is 1 for an individual in the subgroup, 0 otherwise. Laplace’s (1786) estimation of the population of France, with subgroup those registered as born in the previous three years and a cluster sample of over two million, can be regarded as the ﬁrst use of CR as well as the ﬁrst use of a superpopulation model (Pearson 1928, Cochran 1978). Two books discuss CR as sampling theory—by Thompson (1992) and Skalski and Robson (1992).

## 1. Statistical Theory

Early discussion, summarized in e.g. Seber (1982) or Cormack (1968), focused on distributions for the observations—hypergeometric, multinomial etc.— under diﬀerent rules for deﬁning the subgroup and stopping the sampling. Practical diﬀerences are minor. The distribution of the estimate, especially for small subgroups or samples, is more important. If the population size is N, the size of the known subgroup X, and the sample size n contains x from the subgroup, then the intuitive, also the maximum likelihood, estimate of N is [nX/x], known as the Petersen estimate after the Danish ﬁsheries biologist who ﬁrst proposed its use in biology. This is the observed number of diﬀerent individuals [X + n – x] plus the estimate [(X – x)(n -x) x] of the number U of unobserved individuals. As a ratio estimate this is biased: the recommended correction (Chapman 1951) adds 1 to the denominator of the estimate of U. The distribution of the estimate of U is skew, and if U is small, or based on small observed numbers, then a conﬁdence interval based on assumed normality can cause embarrassment by including negative values for U. Likelihood can be used directly (Cormack 1992) to avoid this. Other approaches have been proposed, with diﬀerent loss functions and/or Bayesian methods. The latter are discussed below.

The known subgroup often comprises individuals identiﬁable because they occur in another list or sample. In wildlife applications, in which the next developments took place, the two samples were often obtained by the same protocol at diﬀerent occasions, with a long enough period between for thorough mixing of the mobile animals to be (questionably) assumed. Generalizations of this two-sample case were made, to more samples (Darroch 1958) and, as time periods extended, to allow for the population to change through birth, death, or migration between samples. In open populations, survival estimation from recapture data on marked animals is more reliable than estimation of population size. Open population models were applied to historical lists by James and Price (1976): the willingness of residents of a medieval town to contribute to the town’s defense, and their ability to avoid paying a tax to the king, are clearly displayed, together with evidence of unsuspected epidemics.

The natural statistical way of representing counts in a number of categories, formed by the presence in or absence from a number s of lists, is as the cells in a 2^{s} contingency table. Models and analytical techniques are summarized in books such as Bishop et al. (1975). In CR one cell is not observed, a structural zero, and the aim is, by formulating a series of models for the observed categories in the table, to estimate the missing number by extension of these models. When additional information on individuals is available in the form of quantitative covariates, such as age or severity of disease, the use of logistic models at the individual level was suggested independently by Alho (1990) and by Huggins (1989). The Horvitz–Thompson estimator is used, with inclusion probabilities, of being in each list, estimated from the model, a procedure developed diﬀerently by Pollock and Otto (1983). With no covariate information, or with categorical information only—e.g., gender, ethnicity, geographical region— log linear models form the current standard approach. Maximum likelihood estimates (MLE), perhaps bias corrected, are then derived. In this framework it is natural to examine the data for interactions between categories, evidence that the samples were not random. Birth and death, unequal availability of diﬀerent individuals, and behavioral change consequent on physical capture of animals, have all been expressed as interactions within the log linear framework.

Even thus far, numerous quirks arise whereby the corresponding statistical theory does not quite apply to the CR problem. Estimation from a log linear model does not maximize the complete likelihood: nuisance parameters of sampling probabilities are estimated from the observed cells, and N estimated conditionally on them. Normally such multinomial models for complete tables are ﬁtted assuming a Poisson model, giving the estimates various desirable properties of the exponential family. With N unknown, the distributions of the estimate of N are diﬀerent under multinomial and Poisson models though the point estimate is the same and estimates of nuisance parameters have the same properties. Regularity conditions for desirable properties of MLE do not hold in the multinomial model, but do in the Poisson. The Horvitz–Thompson correspondence shows that the estimate of N is calculated via estimates of 1/(inclusion probabilities). Why then seek unbiased estimates of these probabilities? It is their inverses that are of interest.

## 2. Health And Social Studies

One widespread application in the social sciences is to ascertain how many individuals have some health or social problem, numbers needed for policy and budgetary decisions. Data are pre-existing lists from diﬀerent agencies, each attempting to cover the whole population. Such lists have no or little random element in their formation. They will not, in general, be statistically independent. Individuals are referred from one agency to another. Conversely, two agencies may overlap, or be perceived to overlap, in their function so that an individual attending one may choose not to attend the other. Agencies primarily serving diﬀerent sections of the population, geographically, by age, severity of problem, or whatever, will also be negatively dependent. Such correlation bias results in the standard estimate being an underestimate if lists are positively dependent, an overestimate if negatively dependent.

With two lists dependence is not identiﬁable: independence must be taken on trust. With more than two, log linear models can be used to elucidate the pattern of dependence. With three lists, eight models with diﬀerent combinations of pairwise dependences are available. Extrapolation to the missing cell to estimate U and thus N requires the three-factor interaction to be zero or else expressible, from a model at the individual level, as a function of lower-order interactions. Most studies use some form of model selection to determine which dependencies must be allowed for, usually based on log likelihood ratio statistics from the analysis of deviance—the residual deviance and the diﬀerence between deviances from nested models. The complexity of a model is balanced against its badness of ﬁt. Many authors have advocated parsimony: others argue that a parsimonious model with independence between lists is diﬃcult to believe a priori. With the latter attitude, it might seem best to always ﬁt the saturated model, including all identiﬁable interactions. There are two objections to this. The ﬁrst is that if all lower-order interactions are nonzero, considerable doubt is cast on the necessary act of faith that the highest-order interaction is exactly zero. The other is that the estimate is very sensitive to any small cell count, and gives either a zero or inﬁnite estimate for the missing cell if any cell count is zero, an event which becomes more likely the more lists are included in the study.

Certain patterns of cells with small numbers cause problems for all analyses, in that they reveal that the data contain very little information about some dependencies, either having few overlaps between lists or, perhaps surprisingly, when one list has a very high coverage. Asymptotic distributions of criteria used for model selection no longer hold, and should be replaced by bootstrap resampling of data predicted by diﬀerent models.

More than three lists can be analyzed similarly except that, with more than ﬁve or six, the small number problem usually dominates and routine application of a program to analyze log linear models becomes impossible. One approach to model selection is then to start with the model including all pairwise dependencies, perhaps use backward elimination to simplify, and add higher-order terms by forward selection. However, making inferences conditional on the truth of a model selected from the data has been recognized as the Achilles heel of statistical inference. The alternative of averaging over all models has been applied to CR by several authors (see Sect. 5).

In the study design a list should be sought which is likely to be independent of the others. In studies on drug abuse, lists from police arrests have sometimes appeared to fulﬁl this condition relative to lists from social and health agencies. Questions of commonality of deﬁnition do arise. Moreover, patterns of inter-actions have to allow for individual heterogeneity as well as direct dependence between lists. The resulting confounding is discussed below.

## 3. Heterogeneity

Animals, particularly humans, are not the exchangeable colored balls required by ideal mathematical theory. There are reasons why individual i is not observed in list j. To estimate the numbers missing from all lists requires exploration of these reasons. Exploration and thought are crucial to give even a qualitative impression of the eﬀect of such individual heterogeneity on the estimate. To quantify it needs modeling of p^{ij }the probability that individual i is observed in list j. An early attempt by Cormack (1966) considered p^{ij }= θ_{i} * β_{j}, but this is superseded by the logit model: log {p^{ij}/(1- p^{ij}) }= θ_{i} + β_{j }introduced by Rasch (1960) for educational test scores. In the context of repeated samples with the same protocol and same eﬀort, so that β_{j }may be assumed constant, Burnham and Overton (1978) introduced the idea of the individual eﬀects θ_{i} being a random sample from some distribution. Some workers have developed the consequences of assuming a beta distribution, but, if no speciﬁc form is assumed, the set of capture frequencies—numbers of individuals seen k times—are jointly suﬃcient statistics for N after integration over the distribution.

Without θ_{i }these models represent independent lists: diﬀerences between individuals necessarily cause dependence between lists. The Rasch model for individual behavior corresponds to a log linear model for cell counts with quasisymmetry, all interaction terms of the same order having the same value (Fienberg et al. 1999). Assuming a random model for an individual contribution’s to his probability of being in a list, and integrating over that distribution, creates a superpopulation model for a sample survey. With nonrandom lists, any inference must be model-based, and estimates from superpopulation models are standard in survey sampling. However, while it is easy to accept that an estimate of, say, mean income in a superpopulation gives a reasonable inference for the one ﬁnite population of interest, doubts surface when the parameter of interest is the size of that one ﬁnite population. Unfortunately, allowing direct dependence between lists in a Rasch model for an individual is not the same as a log linear model with only that one extra dependence term. CR has also been expressed as a latent class problem by Agresti (1994).

Heterogeneity can be reduced by analyzing subpopulations as separate strata. Diﬀerences in models selected and in estimated inclusion probabilities can be studied. Samples from some strata may however be too small to permit eﬀective model selection.

## 4. Coverage

Closely related to CR is the classical species problem. How many are there in the world? As a speciﬁc example, how many words did Shakespeare know? Concordances list all occurrences of any word in each play. Some words are used in only one play. There must be many he knew, but never used, just as there are many undiscovered species of beetle. Some words occur much more frequently than others, stark evidence of heterogeneity between individuals (words). Apart from elegant models for speciﬁc applications, debate about sampling distributions and behavior of estimates has paralleled, and often predated, that about CR. A good review has been provided by Bunge and Fitzpatrick (1993). A key concept is that of coverage, the proportion of individuals of species represented in existing lists, perhaps easier to estimate than N, the unknown number of species. Chao and colleagues (e.g., Chao and Lee 1992) have developed various nonparametric estimates of N via coverage. These can be linked to martingale approaches to CR by Yip (1991).

## 5. Bayesian Methods

Bayesian methods have a long history in mark recapture, even if Laplace’s inverse probability argument is not considered pure Bayes. Early formulations make the point that independence of population size N and capture probabilities may not be appropriate a priori. Not only prior beliefs about N but also model uncertainty can be built smoothly into the analysis. Madigan and York (1997) develop models, with some analytical tractability, for decomposable graphical models with hyper-Dirichlet priors for cell probabilities, respecting the model’s pattern of conditional independence, and allowing informative priors for N. Bayesian analysis of hierarchical log linear models has also been developed, an approach linked to the Rasch model by Fienberg et al. (1999).

## 6. Census Undercount

Recent national censuses show nontrivial undercounts. The possibility of adjustment has been widely explored and debated, in the US courts as well as the scientiﬁc and social literature. Because of the scale and expense, there are at least four major diﬀerences from applications discussed above. One is that matching and other nonsampling errors, present in other applications, gain prominence. Another is that the second list, from follow-up ﬁeld survey, is necessarily performed only on a sample of the population. A third is that only two lists are the norm. The fourth is that estimates are required not only for the population as a whole, but for particular minority groups in relatively small geographical areas. Statistical summaries of the US debate have appeared in special sections or issues of the Journal of the American Statistical Association (1993) and of Statistical Science (1994), and in Anderson and Fienberg (1999).

Matching errors tend to identify two records of the same individual as diﬀerent rather than vice versa. With two lists this leads to overestimation of N. Attempts to correct this, by algorithms which declare individuals identical, even when diﬀering in some identiﬁers, can give false positive matches, and hence underestimation. Probabilistic record linkage is discussed by Jaro (1995) and models for its eﬀect on CR estimates by Ding and Fienberg (1996).

In recent usage, the second list comes from a postenumeration survey (PES), an intensive ﬁeld survey of small geographical areas, deﬁned slightly diﬀerently in diﬀerent countries, in a complex randomized sample plan, typically with strata and clusters at two or more levels. The relationship between CR and ratio estimators allows CR models to be combined formally with the sample design, though sampling theory suggests other ways of using auxiliary variables.

With two lists, the required independence must be built into the design. Sampling units must be deﬁned, small enough, to minimize heterogeneity. Small numbers cause instability, and attempts are made to correct this by spatial smoothing over neighboring areas. The multiplicity of ways in which sample-based inference can be integrated with model-based inference from multilevel data with CR models is a topic of much current debate.

## 7. Summary

The ﬁrst diﬃculty with the use of CR is the deﬁnition of the population. If in a study of diabetes all observed individuals can be stratiﬁed by the medical care prescribed, then the missing numbers are not of undiagnosed diabetics, but of diabetics who have been diagnosed, but whose records have been lost. If diﬀerent agencies have diﬀerent deﬁnitions of homelessness, or cover diﬀerent geographical areas, then which population is it hoped to estimate? No statistical analysis of the observed data resolves these problems. Second, there is the problem of matching individuals, a problem which grows with the size of the study, and the separation of the analyst from the data collection. The more sensitive the subject of the study, the more prevalent will be deliberate misinformation from an individual. Matching problems will vary between countries that have diﬀering levels of conﬁdentiality and anonymity. Third, there are the analytic problems caused by the unreality, in many cases, of the assumption of random sampling. Lists are not independent, individuals are not behavioral clones, and the two aspects of dependence and heterogeneity may be at least partially confounded unless care is taken to obtain data which allows disentanglement. Finally, there is the problem of the variety of analytical approaches—which to believe? With several lists, some models can be rejected because of lack of ﬁt to the observed counts. However, failure to reject a model is not positive proof of its truth. Estimation of population size still requires extrapolation to the unobserved, the validity of which is untestable without randomness in the sampling.

Although absolute estimates of population size are often unjustiﬁable, it may be possible, by repeating the same protocol for list-formation at diﬀerent times or places, to obtain reliable assessments of population change or geographical variation in incidence of a disease or social phenomenon.

**Bibliography:**

- Agresti A 1994 Simple capture-recapture models permitting unequal catchability and variable sampling eﬀ Biometrics 50: 494–500
- Alho J M 1990 Logistic regression in capture-recapture models. Biometrics 46: 623–5
- Anderson M J, Fienberg S E 1999 Who Counts? The Politics of Census-taking in Contemporary America. Russell Sage Foundation, New York
- Bishop Y M M, Fienberg S E, Holland P W 1975 Discrete Multivariate Analysis. MIT Press, Cambridge, MA
- Bunge J, Fitzpatrick M 1993 Estimating the number of species: a review. Journal of the American Statistical Association 88: 364–73
- Burnham K P, Overton W S 1978 Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65: 625–33
- Chao A, Lee S-M 1992 Estimating the number of classes via sample coverage. Journal of the American Statistical Association 87: 210–17
- Chapman D G 1951 Some properties of the hypergeometric distribution with applications to zoological censuses. University of California Publications in Statistics 1: 131–60
- Cochran W G 1978 Laplace’s ratio estimator. In: David H A (ed.) Contributions to Survey Sampling and Applied Statistics. Academic Press, New York, pp. 3–10
- Cormack R M 1966 A test of equal catchability. Biometrics 22: 330–42
- Cormack R M 1968 The statistics of capture-recapture methods. Annual Review of Oceanography and Marine Biology 6: 455–506
- Cormack R M 1992 Interval estimation for mark-recapture studies of closed populations. Biometrics 48: 567–76
- Darroch J N 1958 The multiple-recapture census I: estimation of a closed population. Biometrika 45: 343–59
- Ding Y, Fienberg S E 1996 Multiple sample estimation of population and census undercount in the presence of matching errors. Survey Methodology 22: 55–64
- Fienberg S E, Johnson M S, Junker B W 1999 Classical multi-level and Bayesian approaches to population size estimation using multiple lists. Journal of the Royal Statistical Society A162: 383–405
- Huggins R M 1989 On the statistical analysis of capture experiments. Biometrika 76: 133–40
- James T B, Price N A 1976 Measurement of the change in populations through time: capture-recapture analysis of population for St. Lawrence parish, Southampton, 1454 to 1610. Journal of European Economic History 5: 719–36
- Jaro M A 1995 Probabilistic linkage of large public health data ﬁles. Statistics in Medicine 14: 491–8
- Laplace P-S 1786. Sur les naissances, les mariages et les morts a Paris depuis 1771 jusqu’en 1784; et dans toute l’etendue de la France, pendant les annees 1781 et 1782. Memoires de l’Academie des Sciences (1783) 693–702
- Madigan D, York J C 1997 Bayesian methods for estimation of the size of a closed population. Biometrika 84: 19–31
- Pearson K 1928 On a method of ascertaining limits to the actual number of marked members in a population of given size from a sample. Biometrika 20A: 149–74
- Pollock K H, Otto M C 1983 Robust estimation of population size in a closed animal population from capture-recapture experiments. Biometrics 39: 1035–49
- Rasch G 1960 Probabilistic Models for Some Intelligence and Attainment Tests. University of Chicago Press, Chicago
- Seber G A F 1982 The Estimation of Animal Abundance and Related Parameters. Griﬃn, London
- Skalski J R, Robson D S 1992 Techniques for Wildlife Investigations: Design and Analysis of Capture Data. Academic Press, San Diego, CA
- Thompson S K 1992 Sampling. Wiley-Interscience, New York
- Yip P 1991 A martingale estimating equation for a capture recapture experiment in discrete time. Biometrics 47: 1081–8