11  Data Destruction or Termination + Future Challenges

For this week, we will learn the importance of destroying or terminating data.

11.1 Quick recap on week 5

11.1.1 Data privacy laws in the United States of America

The laws governing data collection and dissemination are limited to specific federal agencies (e.g., U.S. Census Bureau) or specific data types (e.g., education).

The California Consumer Privacy Act (CCPA) gives consumers more control over the personal information that businesses collect about them, with accompanying regulations that provide guidance on how to implement the law. In November 2020, California voters approved Proposition 24, the California Privacy Rights Act, which amended the CCPA and added additional privacy protections that took effect on January 1, 2023.

11.1.2 Identifying the technical and policy solutions

Decision matrix

Figure 11.1: Model decision matrix of disclosure-protection strategies given potential disclosure harms and data usefulness from Reiter et al. (2024).

Aspirational equitable statistical data privacy workflow

Figure 11.2: A visual representation of the aspirational equitable statistical data privacy workflow from Bowen and Snoke (2023).

A Framework for Managing Disclosure Risks in Blended Data

  1. Determine auspice and purpose of the blended data project.

    1. What are the anticipated final products of data blending?
    2. What are potential downstream uses of blended data?
    3. What are potential considerations for disclosure risks and harms, and data usefulness?
  2. Determine ingredient data files.

    1. What data sources are available to accomplish blending, and what are the interests of data holders?
    2. What steps can be taken to reduce disclosure risks and enhance usefulness when compiling ingredient files?
  3. Obtain access to ingredient data files.

    1. What are the disclosure risks associated with procuring ingredient data?
    2. What are the disclosure risk/usefulness trade-offs in the plan for accessing ingredient files?
  4. Blend ingredient data files.

    1. When blending requires linking records from ingredient files, what linkage strategies can be used?
    2. Are resultant blended data sufficiently useful to meet the blending objective?
  5. Select approaches that meet the end objective of data blending.

    1. What are the best-available scientific methods for disclosure limitation to accomplish the blended data objective, and are sufficient resources available to implement those methods?
    2. How can stakeholders be engaged in the decision-making process?
    3. What is the mitigation plan for confidentiality breaches?
  6. Develop and execute a maintenance plan.

    1. How will agencies track data provenance and update files when beneficial?
    2. What is the decision-making process for continuing access to or sunsetting the blended data product, and how do participating agencies contribute to those decisions?
    3. How will agencies communicate decisions about disclosure management policies with stakeholders?

11.1.3 Types of dissemination

Figure 11.3: The “Complexity Pyramid” from Schwabish (2020).

11.1.4 Data visualizations

Similar to outlining a presentation, don’t race to the computer. Start with paper and pen (or color pencils/markers!), chalkboard and chalk, etc. to sketch out your data visualization.

  • Show the data
  • Reduce the clutter
  • Integrate the graphics and text
  • Avoid Speghetti chart
  • Start with gray

11.2 Data destruction or termination

Upon project completion, your team should consider the physical destruction of all copies of confidential data as required by contract (e.g., data use agreement). Alternatively, after three years, you should remove all identifiers from the data and maintain a name-code index separately in a secure location.

Contract for data destruction

Projects involving confidential data—such as Personally Identifiable Information (PII), Personal Health Information (PHI), or credit card information—typically include a clause requiring data destruction within a specified period after the project’s end. The data destruction must be carried out using acceptable methods, often specified for “readily accessible storage.” Some contracts require proof of data destruction, which some companies have a process to create a Certificate of Destruction (or something equivalent). Your contract may also stipulate that this process must be witnessed and/or notarized, necessitating a witness to confirm the data destruction.

Most private companies and other entities have their own policies for data disposal. Generally, acceptable methods fall into two categories: logical and physical.

11.2.1 Logical

Logical destruction involves erasing data with multiple read/write commands, such as those performed by Pretty Good Privacy1 File Shredder or by emptying the file system’s recycle bin.

11.2.2 Physical

Physical disposal involves destroying the data media itself, such as shredding paper documents and CDs/DVDs, degaussing magnetic media, or drilling holes through hard drives.

Figure 11.4: Ron Swanson from the TV Show, “Parks and Recreation,” smashing a phone with a hammer on his work desk.

11.2.3 Data destruction types

The National Institute of Standards and Technology 800-88 standard Guidelines for Media Sanitization define three types of data destruction:

  1. Clear applies logical techniques to sanitize data in all user-addressable storage locations for protection against simple non-invasive data recovery techniques; typically applied through the standard Read and Write commands to the storage device, such as by rewriting with a new value or using a menu option to reset the device to the factory state (where rewriting is not supported).
  2. Purge applies physical or logical techniques that render Target Data recovery infeasible using state of the art laboratory techniques.
  3. Destroy renders Target Data recovery infeasible using state of the art laboratory techniques and results in the subsequent inability to use the media for storage of data.

The Department of Defense 5520.22 M standard requires a minimum of 3 passes to erase electronic data. Pass 1 writes all zeroes, pass 2 writes all ones, pass 3 writes random data. This makes the data unrecoverable with physical access to the storage medium.

11.2.4 Ethics and equity considerations

In addition to security and privacy considerations, we should think about the ethical and equity reasons for destroying data. Some points to consider are:

  • How should researchers who collect data inform participants about data destruction policies without impacting the data collection process?
    • Consider the transparency aspect.
  • Are there situations where destroying data could be unethical?
    • Consider scenarios where data could be valuable for future research versus scenarios where its existence might cause harm.
  • How might data destruction disproportionately impact marginalized or vulnerable populations?
    • Consider how to engage stakeholders.

11.3 Future Challenges

Data curators, stakeholders, privacy experts, and data users all face challenges concerning a desire to expand the use of government data, especially in building linkages between administrative and survey data. Unless and until data privacy is securely protected, such data are likely to be available only narrowly, or not at all.

11.3.1 Educating the data user community

Little is known about the expectations and needs of data users in general—let alone their understanding and perceptions of more modern data privacy methods. Williams et al. (2023) conducted a convenience sample survey of economists from the American Economic Association on their baseline knowledge about differential/formal privacy (the concept used on the 2020 Census), attitudes toward differentially/formally private frameworks, types of statistical methods that are most useful to economists, and how the injection of noise under formal privacy would affect the value of the queries to the user. At a high level, the survey found that most economists are unfamiliar with formal privacy and differential privacy (and if they know about it, they are skeptical). Instead, economists rely on simple methods for analyzing cross-sectional administrative data but have a growing need to conduct more sophisticated research designs, and economists have low tolerance for errors, which is incompatible with existing formal privacy definitions and methods.

The results from the Williams et al. (2023) survey are not surprising. In general, traditional statistical data privacy methods are more intuitive and easier to explain, such as why data curators should remove unique records. In contrast, formally private methods are more complex and lack an intuitive definition. Although there has been an explosion of new communication materials to explain formal privacy and other data privacy concepts,2 such efforts are trying to fill a chasm and we are not even close. To put it into perspective, if we asked random economists to recommend their favorite education or communication materials about, say, machine learning or artificial intelligence, many would have a favorite book or blog series in mind. They may even have suggested materials that are more focused on concepts, or a perspective from a certain field, or on coding. If we asked random economists the same question, but for data privacy in the context of safe access to administrative and survey data, they likely would have few recommendations or even none at all.

One way to address the lack of education and communication materials is to teach the next generation and increase the number of those in the field. Yet despite the need for data privacy education, most higher education institutions do not offer dedicated courses on the topic. When data privacy is taught, it is typically at the graduate level within computer science departments. Some undergraduate professors who research data privacy and confidentiality may introduce these topics in seminar courses, but they are not usually stand-alone courses. As a result, individuals with technical backgrounds outside of computer science, such as economists, are greatly underrepresented in this important area of study. Therefore, other departments outside of computer science should consider hosting their own statistical data privacy courses or incorporating these concepts into existing courses. When integrating these statistical data privacy concepts, professors can encourage students to consider the legal, social, and ethical implications of data privacy, ethics, and equity. They can also delve into the principles of data guardianship, custodianship, and data permissions (Williams and Bowen 2023).

11.3.2 Addressing data equity in data privacy

The methods used to protect individuals’ information do not always have an equal impact on all groups represented in the data. A published dataset might ensure the privacy of people who are the majority in the dataset but fail to ensure the privacy of those in smaller groups. Similarly, alterations to the data may be more useful for learning about some groups more than others. Ultimately, how entities collect and share data can have varying effects on underrepresented groups of people.

Although there are many discussions on data equity and data privacy, few conversations focus on equity in privacy. In light of this (and as we learned in the previous class), Bowen and Snoke (2023) developed a guide as part of the “Do No Harm Guides” series. This fourth installment of the series focuses on exploring the current state of equity-focused work in statistical data privacy. The authors conducted interviews with nine experts in privacy-preserving methods and data-sharing, including researchers and practitioners from academia, government, and industry sectors with diverse technical backgrounds. The authors asked about the experience of these experts in implementing statistical data privacy methods and how they define equity in the context of privacy, among other topics. The authors then created an illustrative example to highlight potential disparities that can result from applying various statistical data privacy concepts (including suppression, synthetic data, and differential privacy) without an equitable workflow. Here are some of their key takeaways: do not treat equity as a separate field of study; work with groups represented in your data; and there is no methodological silver bullet.

11.3.3 Engaging with data privacy issues

There are a few prominent options for learning about data privacy methods and becoming involved in these topics besides becoming a privacy researcher. For instance, the Joint Program in Survey Methodology at the University of Maryland has been offering a course on synthetic data.3 The Urban Institute offered an all-day course at the 2023 Joint Statistical Meetings, where the instructors introduced the basics of data privacy.4 The Urban Institute has also offered similar trainings for the Bureau of Economic Analysis, Allegheny County, and the Statistics of Income Division.

There is currently no dedicated conference focused on the intersection of data privacy and public policy, but interest in the field is growing. In 2023, the National Bureau of Economic Research5 and National Institute of Statistical Sciences6 hosted separate data privacy workshops that brought together privacy experts and data users. Attendees from these workshops are currently organizing the first ever Privacy and Public Policy Conference in 2024 with the goal “to foster and enhance collaboration among privacy experts, researchers, data stewards, data practitioners, and public policymakers.”7 With the recent surge of venues, the time is obviously ripe to help shape the future of data privacy, make meaningful contributions to its policy debates, and ensure the responsible representation of people in data.


  1. “Pretty Good Privacy is an encryption program that provides cryptographic privacy and authentication for data communication.” from https://en.wikipedia.org/wiki/Pretty_Good_Privacy↩︎

  2. One of my favorites is a video created by minutephysics for the US Census Bureau, available at “Protecting Privacy with MATH (Collab with the Census),” (accessed on June 27, 2024).↩︎

  3. “Synthetic Data: Balancing Confidentiality and Quality in Public Use Files,” a course by Joerg Drechsler and Jerome P. Reiter. Course schedule no longer available.↩︎

  4. See “Introduction to Data Privacy and Data Synthesis Techniques,” a course by Aaron R. Williams and Claire McKay Bowen.↩︎

  5. “Data Privacy Protection and the Conduct of Applied Research: Methods, Approaches, and their Consequences, Spring 2023,” hosted by the National Bureau of Economic Research. Event site no longer available.↩︎

  6. IOF Workshop: Advancing Demographic Equity with Privacy Preserving Methodologies,” hosted by the National Institute of Statistical Sciences↩︎

  7. Privacy and Public Policy Conference↩︎