Confidentiality And Statistical Disclosure Research Paper

Academic Writing Service

Sample Confidentiality And Statistical Disclosure Research Paper. Browse other research paper examples and check the list of research paper topics for more inspiration. iResearchNet offers academic assignment help for students all over the world: writing from scratch, editing, proofreading, problem solving, from essays to dissertations, from humanities to STEM. We offer full confidentiality, safe payment, originality, and money-back guarantee. Secure your academic success with our risk-free services.

1. Data Quality Rests On Usefulness And Confidentiality Protection

We capture enormous numbers of rich personal and proprietary records, store them in data warehouses measured in terabytes, analyze them using sophisticated computers and algorithms, and disseminate them worldwide via telecommunications channels that are cheap and fast. This explosive progress in in-formation technology—look at the World Wide Web—heightens the tension between confidentiality and data access. Accepted principles of information ethics (Duncan et al. 1993) require that promises of confidentiality be preserved and that citizens have substantial access to information. Policies and procedures are needed to reconcile the data provider’s interest in confidentiality and the data user’s demand for data (Dalenius 1988). Government statistical agencies, such as the US Census Bureau, play a particularly significant and delicate brokering role in this reconciliation of competing demands. They often have legal authority to compel individuals and firms to provide data, and the data are both sensitive—like personal income—and so comprehensive and accurate to be in substantial demand by a broad range of data users. Academic researchers, corporate planners, journalists, and government analysts all seek such data. Other organizations, such as health care providers, necessarily collect data in treating patients that is highly personal. These data are of substantial research value for issues spanning medical quality assurance, cost containment, and access to medical treatment by the poor (Duncan and Kaufman 1996).

Academic Writing, Editing, Proofreading, And Problem Solving Services

Get 10% OFF with 24START discount code


A compromise of the confidentiality pledge could harm the agency, the respondent, the public, or some other group. A statistical disclosure occurs when the data dissemination allows data snoopers to gain excessive new information about respondents whereby the snooper can isolate individual respondents and corresponding sensitive attribute values. Miller (1971, p. 176) illustrates the problem.

Some deficiencies inevitably crop up even in the Census Bureau. In 1963, for example, it reportedly provided the American Medical Association with a statistical list of one hundred and eighty-eight doctors living in Illinois. The list was broken down into more than two dozen categories, and each category was further subdivided by medical specialty and area residence; as a result, identification of individual doctors was possible …




As shown in Fig. 1, the process of assuring confidentiality through statistical disclosure limitation has the following components:

(a) a data quality audit that, beginning with the original collected data, evaluates data utility and assesses disclosure risk;

(b) a determination of adequacy of confidentiality protection;

(c) if confidentiality protection is inadequate, the implementation of a disclosure limitation procedure; and

(d) a return to the data quality audit.

Confidentiality And Statistical Disclosure Research Paper

2. Data Quality Audit: Data Utility And Disclosure Risk

A statistical agency audits the collected data, both to evaluate the utility of the data and to assess disclosure risk. Typically, with good survey design and implementation, the data utility is high. But, also, the risk of disclosure through the release of the original, collected data is too high, even when the data collected have been deidentified, i.e., apparent identifiers (name, e-mail address, phone number, etc.) have been re-moved. Reidentification techniques have become too sophisticated to assure confidentiality protection (Winkler 1998). A confidentiality audit will include identification of (a) sensitive objects, and (b) characteristics of the data that make it susceptible to attack.

Based on its confidentiality pledges, a statistical agency seeks to protect certain sensitive objects. An agency seeks to protect the instantiation of sensitive objects in the data from a data snooper. Sensitive objects can be the instantiations of a variety of variables associated with a subject entity (person, household, enterprise, etc.). Examples include the values of numerical variables, such as household income, a realization of an x-ray of a patient’s lung, and a specific patent application. A second class of sensitive objects is instances of relationships, whether social or mathematical. An example of the former is ‘Who is friends with whom?’ in a social network; an example of the latter is an Internal Revenue Service formula to trigger a tax audit.

Collected data having particular characteristics pose substantial risk of disclosure. Characteristics that suggest vulnerability include:

(a) geographical detail (Greenberg and Zayatz 1992);

(b) longitudinal or panel structure;

(c) outliers;

(d) many attribute variables;

(e) population data, as in a census, rather than a survey with small sampling fraction; and

(f) existence of databases that are publicly available, identified, and share individual respondents and at- tribute variables with the subject data.

Data with geographical detail, such as census tract data, may be easily linked to known characteristics of respondents. Concern for this leads statistical agencies to place minimum population levels for geographical identifiers that they will release. Longitudinal data, which tracks entities over time, also poses substantial disclosure risk. Many electrical engineers lived in the Chicago area in 1998 and many electrical engineers lived in the Phoenix area in 1999, but few did both. Outliers, especially on variables like net worth, can easily lead to identifiable respondents. Data with many attribute variables allows easier linkage with known attributes of identified entities, and entities, which are unique in the sample, are more likely to be unique in the population. Further, population data, i.e., a census or near census, pose more disclosure risk than data that arise from a survey having a small sampling fraction. Finally, special concern must be shown when other databases are available to the data snooper and these databases are both identified and share with the subject data both individual respondents and certain attribute variables. Record linkage may then be possible between the subject data and the external database. The shared attribute variables provide the key.

3. Types Of Statistical Disclosure

The legitimate objects of inquiry for statistical research are aggregates over individual records, for example, the median annual income of Hispanic social scientists in the United States. The statistical agency seeks to provide users with data that will allow accurate inference about such population characteristics. At the same time, because of confidentiality the statistical agency seeks to thwart the data snooper who might seek to use the disseminated data to draw accurate inferences about, say, the income of a particular Hispanic sociologist who now works in New York City. This capability by a data snooper represents a statistical disclosure.

There are two major types of disclosure—identity disclosure and attribute disclosure. Identity disclosure occurs with the association of a respondent’s identity and a disseminated data record (Spruill 1983). Attribute disclosure occurs with the association of either an attribute value in the disseminated data or an estimated attribute value based on the disseminated data with the respondent (Duncan and Lambert 1989, Lambert 1993). In the case of identity disclosure, the association is assumed exact. In the case of attribute disclosure, the association can be approximate. Most statistical agencies place emphasis on limiting the risk of identity disclosure, perhaps because of its substantial equivalence to the inadvertent release of an identified record—a clear administrative slip up. On the other hand, an attribute disclosure, even though it invades the privacy of a respondent, may not be so easily traceable to actions of the agency.

4. Measures Of Disclosure Risk

In the context of identity disclosure, disclosure risk can arise because a data snooper may be able to use the disseminated data product to re-identify some de-identified records. Spruill (1983) proposed a measure of disclosure risk for microdata: For each ‘test’ record in the masked file, compute the Euclidean distance between the test record and each record in the source file. Determine the percentage of test records that are closer to their parent source record than to any other source record. She defines the risk of disclosure to be the percentage of test records that match the correct parent record multiplied by the sampling fraction (fraction of source records released).

More generally, and consistent with Duncan and Lambert (1986, 1989), the agency will have succeeded in protecting the confidentiality of a released data product if the data snooper remains sufficiently uncertain about a protected target value after data release. From this perspective, a measure of disclosure risk is built on measures of uncertainty. Further, the agency may model the decision making of the data snooper as a basis for using disclosure limitation to deter inferences about a target. Data snoopers are deterred from publicly making inferences about a target when their uncertainty is sufficiently high. Mathematically, uncertainty functions provide a workable framework for this analysis. Examples include Shannon entropy, which has found use in categorizing continuous microdata and coarsening of categorical data (Willenborg and de Waal 1996, p. 138).

Generally, a data snooper has a priori knowledge about a target, often in the form of a database with identified records (Adam and Wortmann 1989). Certain variables may be in common with the subject database. These variables are called key or identifying. When a single record matches on the key variables, the data snooper has a candidate record for identification. This candidacy is promoted to an actual identification if the data snooper is convinced that the individual is in the target database. This would be the case either if the data snooper has auxiliary information to that effect or if the data snooper is convinced that the individual is unique in the population. The data snooper may find that according to certain key variables that a sample record is unique. The question then is whether the individual is also unique on these key variables in the population. Bethlehem et al. (1990) have examined detection of records agreeing on simple combinations of keys based on discrete variables in the files. Record linkage methodologies have been examined by Fuller (1993) and Winkler (1998).

5. Restricted Data Through Statistical Disclosure Limitation

Direct transformations of data for confidentiality purposes are called disclosure limiting masks (Jabine 1993). With masked data sets, there is a specific functional relationship, possibly as a function of multiple records and possibly as a stochastic function, between masked values and the original data. Because of this relationship, the possibilities of both identity and attribute disclosures continue to exist, even though the risk of disclosure may be substantially reduced. The idea is to provide a response that, while useful for statistical analysis purposes, has sufficiently low disclosure risk. As a general classification, disclosure-limiting masks can be categorized as suppressions, recodings, or samplings.

5.1 Suppression

A suppression is a refusal to provide a data instance. For microdata, this can involve the deletion of all values of some particularly sensitive variable. In principle, certain record values could also be sup-pressed, but this is usually handled through recoding. For tabular data, the values of table cells that pose confidentiality problems are suppressed. These are the primary suppressions. Often, a cell is considered unsafe for publication according to the (n, p) dominance rule, i.e., if a few (n), say 3, contributing entities represent a percentage p, say 70 percent, or more of the total. Additionally, enough other cells are suppressed so that the values of the primary suppressions cannot be inferred from released table margins. These additional cells are called secondary suppressions. Tables of realistic dimensionality even with only a few primary suppressions present a multitude of possible configurations for the secondary cell suppressions. This raises computational difficulties that can be formulated as combinatorial optimization problems. Typical techniques that are employed include mathematical programming (especially integer programming), and graph theory (Chowdhury et al. 1999).

5.2 Recoding

A disclosure-limiting mask for recoding creates a set of data for which some or all of the attribute values have been altered. Recoding can be applied to microdata or to tabular data.

Examples of recoding as applied to microdata include data swapping, adding noise, and global recoding and local suppression. In data swapping (Dalenius and Reiss 1982, Spruill 1983), some fields of a record are swapped with the corresponding fields in another record. Concerns have been raised that while data swapping lowers disclosure risk it may excessively distort the statistical structure of the original data (Adam and Wortmann 1989). A combination of data swapping with additive noise has been suggested by Fuller (1993). Masking through the introduction of additive or multiplicative noise has been investigated (e.g., Fuller 1993). A disclosure limitation method for microdata that is used in the µ-Argus software is a combination of global recoding and local suppression. Global recoding combines several categories of a variable to form less-specific categories. Topcoding is a specific example of global recoding. Local suppression suppresses certain values of individual variables (Willenborg and de Waal 1996). The aim is to reduce the set of records where only a few agree on particular combinations of key values. Both methods make the data less specific and so result in some information loss to the legitimate researcher.

Some common methods of recoding for tabular data are global recoding and rounding. Under global recoding, categories are combined. This re-presents a coarsening of the data through combining rows or combining columns of the table. Under rounding, every cell entry is rounded to some base b. The controlled rounding problem is to find some perturbation of the original entries that will satisfy (marginal, typically) constraints and that is ‘close’ to the original entries (Cox 1987). Multidimensional tables present special difficulties. Methods for dealing with them are given by Kelley et al. (1990).

Markov perturbation (Duncan and Fienberg 1999) makes use of stochastic perturbation through entity moves according to a Markov chain. Because of the cross-classified constraints imposed by the fixing of marginal totals, moves must be coupled. This coupling is consistent with a Grobner basis structure (Fienberg et al. 1998). In a graphical representation, it is consistent with data flows corresponding to an alternating cycle, as discussed by Cox (1987).

5.3 Sampling

Sampling as a disclosure-limiting mask creates an appropriate statistical sample of the original data. Alternatively, if the original data is itself a sample, the data may be considered self-masked. Just the fact that the data are a sample may not result in disclosure risk sufficiently low to permit data dissemination. In that case, subsampling may be required to obtain a data product with adequately low disclosure risk.

Whether for microdata or tabular data, many of these transformations can be represented as matrix masks (Duncan and Pearson 1991), M=AXB +C, where X is a data matrix, say n ×p. In general, the defining matrices A, B and C can both depend on the values of X and be stochastic. The matrix A (since it operates on the rows of X is a record-transforming mask, the matrix B (since it operates on the columns of X is a variable-transforming mask, and the matrix C is a displacing mask (noise addition).

6. Synthetic Or Model-Based Data

The methods described so far have involved per-turbations or masking of the original data. These are called data-conditioned methods by Duncan and Fienberg (1999). Another approach, while, less studied, should be conceptually familiar to statisticians. Con-sider the original data to be a realization according to some statistical model. Replace the original data with samples (the synthetic data) according to the model. Synthetic data sets consist of records of individual synthetic units rather than records the agency holds for actual units. Rubin (1993) suggested synthetic data construction through a multiple imputation method. It remains an open research question as to what the impact of imputation of an entire microdata set is on data utility.

Rubin (1993) asserts that the risk of identity disclosure can be eliminated through the dissemination of synthetic data and proposes the release of synthetic microdata sets for public use. His reasoning is that the synthetic data carries no direct functional link between the original data and the disseminated data. So while there can be substantial identity disclosure risk with (inadequately) masked data, identity disclosure is, in a strict sense, impossible with the release of synthetic data. However, the release of synthetic data may still involve risk of attribute disclosure (Fienberg et al. 1998).

Rubin (1993) cogently argues that the release of synthetic data has advantages over other data dis-semination strategies, because

(a) masked data can require special software for its proper analysis for each combination of analysis masking method database type (Fuller 1993);

(b) release of aggregates, e.g., summary statistics or tables, is inadequate due of the difficulty in contemplating at the data-release stage what analysts might like to do with the data; and

(c) mechanisms for the release of microdata under restricted access conditions, e.g., user-specific administrative controls, can never fully satisfy the demands for publicly available microdata.

The methodology for the release of synthetic data is simple in concept, but complex in implementation. Conceptually, the agency would use the original data to determine a model to generate the synthetic data. But the purpose of this model is not the usual prediction, control, or scientific understanding that argues for parsimony through Occam’s Razor. In-stead, its purpose is to generate synthetic data useful to a wide range of users. The agency must recognize uncertainty in both model form and the values of model parameters. This argues for the relevance of hierarchical and mixture models to generate the synthetic data.

7. Summary

A society based on democratic and free market principles cannot function without broad access to data, nor can it sustain those principles without affirming the individual autonomy assured by privacy and confidentiality. Statistical disclosure limitation methods work to ensure that the tension between data access and confidentiality can be resolved in ways that are favourable to both. This methodology draws on disparate fields of mathematics and is challenged to develop in response to the growing capability of information technology to capture, store and disseminate data.

Bibliography:

  1. Adam N R, Wortmann J C 1989 Security-control methods for statistical databases: A comparative study. ACM Computing Surveys 21: 515–56
  2. Bethlehem J G, Keller W J, Pannekoek J 1990 Disclosure control of microdata. Journal of the American Statistical Association 85: 38–45
  3. Chowdhury S D, Duncan G T, Krishnan R, Roehrig S F, Mukherjee S 1999 Disclosure detection in multivariate categorical databases: Auditing confidentiality protection through two new matrix operators. Management Science 45: 1710–23
  4. Cox L H 1980 Suppression methodology and statistical dis-closure control. Journal of the American Statistical Association 75: 377–85
  5. Cox L H 1987 A constructive procedure for unbiased controlled rounding. Journal of the American Statistical Association 82: 38–45
  6. Dalenius T 1988 Controlling Invasion of Privacy in Surveys. Department of Development and Research Statistics, Sweden
  7. Dalenius T, Reiss S P 1982 Data-swapping: A technique for disclosure control. Journal of Statistical Planning and Inference 6: 73–85
  8. Duncan G T, Jabine T B, de Wolf V A 1993 Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics. Panel on Confidentiality and Data Access, Committee on National Statistics, National Academy Press, Washington, DC
  9. Duncan G T, Fienberg S E 1999 Obtaining information while preserving privacy: a Markov perturbation method for tabular data. Eurostat. Statistical Data Protection ’98, Lisbon 1999. Office for Official Publications of the European Communities, Luxembourg, pp. 351–62
  10. Duncan G T, Kaufman S 1996 Who should manage information and privacy conflicts?: Institutional design for third-party mechanisms. The International Journal of Conflict Management 7: 21–44
  11. Duncan G T, Lambert D 1986 Disclosure-limited data dissemination (with discussion). Journal of the American Statistical Association 81: 10–28
  12. Duncan G T, Lambert D 1989 The risk of disclosure of microdata. Journal of Business and Economic Statistics 7: 207–17
  13. Duncan G T, Pearson R 1991 Enhancing access to microdata while protecting confidentiality: Prospects for the future (with discussion). Statistical Science 6: 219–39
  14. Federal Committee on Statistical Methodology 1994 Statistical Policy Working Paper 22: Report on Statistical Disclosure Limitation Methodology. US Office of Management and Budget, Washington, DC
  15. Fienberg S E, Makov U E, Steele R J 1998 Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics 14: 347–60
  16. Fuller W A 1993 Masking procedures for microdata disclosure limitation. Journal of Official Statistics 9: 383–406
  17. Greenberg B, Zayatz L 1992 Strategies for measuring risk in public use microdata files. Statistica Neerlandica 46: 33–48
  18. Jabine T 1993 Statistical Disclosure Limitation Practices of United States Statistical Agencies Journal of Official Statistics 9: 427–54
  19. Kelley J, Golden B, Assad A 1990 Controlled rounding of tabular data. Operations Research 38: 760–72
  20. Lambert D 1993 Measures of disclosure risk and harm. Journal of Official Statistics 9: 313–31
  21. Miller A R 1971 The Assault on Privacy: Computers, Data Banks and Dossiers. University of Michigan, Ann Arbor, MI
  22. Rubin D B 1993 Satisfying confidentiality constraints through the use of synthetic multiply-imputed microdata. Journal of Official Statistics 9: 461–8
  23. Spruill N L 1983 The confidentiality and analytic usefulness of masked business microdata. Proceedings of the Section on Survey Research Methods, American Statistical Association 602–7
  24. Willenborg L, de Waal T 1996 Statistical Disclosure Control in Practice. Lecture Notes in Statistics 111. Springer, New York
  25. Winkler W E 1998 Re-identification methods for evaluating the confidentiality of analytically valid microdata. Research in Official Statistics 1: 87–104
Conjoint Analysis Applications Research Paper

ORDER HIGH QUALITY CUSTOM PAPER


Always on-time

Plagiarism-Free

100% Confidentiality
Special offer! Get 10% off with the 24START discount code!