What safeguards can prevent disclosure of confidential and sensitive data?

Oftentimes, datasets do not need to contain complete or full data records to be considered valuable.  Where confidential/sensitive data exists, there are safeguards you can use.  A few methods are noted below.  The method or combination of methods you use will depend on the disclosure risks associated with your data.  There are also many laws/regulations and federal program requirements that are intended to protect privacy.  Some of these may specifically dictate what data can be made available, and may require certain safeguard methods to be used.  Which methods you utilize is a decision left with your agency.

Suppressing Fields/Records

Suppressing fields involves excluding, in their entirety, any fields (columns) in a dataset that contain the personally identifiable information or confidential data from the version made publicly accessible.  Suppressing records (rows) in a dataset involves excluding those rows in their entirety.  However, you are encouraged to review other safe guard measures prior to suppressing records.

If any codes or identifiers are left in the dataset to allow records to be re-identified, they must not be known or accessible to unauthorized parties, or linkable to other external records.

Suppressing Data Values

Suppressing data values involve replacing data values with a generic term when only a portion of records in a specific field contain confidential data.  For example, in our checkbook data, values for Vendor and Vendor Number have been redacted and replaced with “Restricted” where the record was flagged as confidential.

Generalizing, Collapsing or Top/Bottom Coding Data

Generalizing, collapsing or top/bottom coding data values involves using methods to make the data less precise as a means to ensure confidential data is protected.  These methods can be used where unique characteristics result in small frequency counts.

Generalizing often applies to categorical or contextual data.  This could include modifying a full address to present only the city, state and zip code, or providing a county name rather than a city or coordinate.  There may also be instances where categorical data is too specific and can lead to individual identification, and needs to be replaced with a broader category (e.g. race categories for Chinese, Filipino, Japanese, Korean, and Vietnamese could be changed to a broader category of Asian).  It may even be necessary to make your broader category “Other” in order to preserve as much detail in the dataset.

Collapsing or top/bottom coding involves transforming numeric data into ordinal[1] or interval[2] data.  For instance, rather than providing the specific age of the individual associated with a record, you provide a range (e.g. 0 – 18 years old).  Top coding and bottom coding involves collapsing top end and/or low end values into one category, rather than transforming all values into ordinal or interval data.  For instance, a high income could lead to the identification of an individual.  Top coding would involve generalizing values above a certain level (e.g. >$500,000)[3].  The same principles apply for bottom coding – it is just on the opposite end of the range.

Grouping Data

Grouping data involves combining individual records (e.g. food assistance grants) into a single row of data with common characteristics using categorical data or combined groupings of categorical data (e.g. county and month), and totaling and/or averaging numeric values (e.g. grants awarded) contained in individuals records, and providing frequency counts of records (e.g. number of grants) within each group.  By doing this, you are essentially publishing your dataset in aggregated form.  Grouping data is most commonly used where the previous safe guards do not adequately protect the data for disclosure risk, or where you want to report on a confidential variable that may be disclosed in summary form.

However, there may still be instances where even aggregated data may still enable someone to derive information on or closely estimate information for specific individuals.  In these instances, you may need to further suppress the data.[4]  In instances where the numerical data needs to be further suppressed, consider whether rows containing the suppressed data could be further aggregated and reported to facilitate higher level summaries that couldn’t happen if it were not present.

Explain Safeguards Used

If you employ safeguards to protect personally identifiable information and other confidential and sensitive data, you should describe these in the dataset’s metadata to ensure they are applied consistently over time.


[1] Ordinal data is similar to categorical data except that there is a clear ordering (e.g. low, medium, high)

[2] Interval data is similar to ordinal data except that there are equally spaced variables (e.g. 1 – 10, 11 – 20, … , 51 – 60, etc.)

[3] There are methods for basing top codes/bottom codes on a percentage of the total records, but this could result in the values falling into these categories to change over time.

[4] Two commonly used methods to determine the sensitivity associated with aggregated data are the (n,k) dominance rule and the (p,q) prior posterior rule.

Program Area
Transparency
Topic(s)
Open Data, Privacy, Confidentiality, Safeguards

Printed from the Iowa Department of Management website on January 22, 2018 at 12:30pm.