What should I consider when checking the quality of a dataset?
Your agency should implement a quality assurance process (i.e. document data checking and review steps) for datasets you intend to publish on data.iowa.gov. How rigorous your quality assurance process is will be highly dependent upon the nature of the dataset and how it is used. This guidance is not intended to prescribe what quality assurance should look like for your data, but rather highlight a few items to consider before publishing a dataset.
Many datasets will be based on complex queries involving joins which have the potential to affect the number of rows returned if done incorrectly. For datasets that are not grouped, you should compare the number of records (or rows) in your dataset to the primary table (i.e. the most granular table with the key numeric data you want to include) before publishing your dataset. For datasets that are grouped and contain aggregated numeric values, you will want to compare values in the dataset against reports in the source system.
If your dataset has categorical data that was not the result of a “many to 1” or “1 to 1” join, or the database, spreadsheet, or file where the data resides does not enforce input standards, there is the possibility that your dataset will contain inconsistent values: For example:
- “Mt. Pleasant” and “Mount Pleasant” refer to the same town in Iowa
- “Dept. of Public Health”, “DPH” and “Iowa Department of Public Health” all represent the same state agency
- “M” and “Male,” or “F” and “Female” refer to the same gender
- Categorical data value with leading and/or trailing spaces, and same value without
- Different addresses for the same entity
If inconsistent data values are present in your dataset, aggregated numeric data grouped by inconsistent categorical data will be inaccurate. Where you think this may be the case, you should consider running a pivot table on your data or use some other method to see a list of unique values in the categorical data and identify values that represent the same category. If you have multiple values representing the same category, you will want to decide which value you intend to use, and which values you intend to replace. The effort to correct 100% of inconsistencies found may not be worth it. Any known inconsistencies that are not corrected should be highlighted in the dataset’s disclaimers - under limitations.
There may also be instances where the definition of categorical values has changed overtime (e.g. school consolidations, business mergers, etc.). This is likely hard to detect in the data itself, and will make it challenging to make comparisons over time using those categories. Where this is known, you should attempt to document or highlight it, as it may serve as a constraint for how the data can be used.
Missing values, or attributes (i.e. contextual and categorical data) can result in misleading results. To identify areas with missing values, you should scan the dataset and note gaps in the records. You can also group and aggregate data at various levels of granularity and compare against other reports to find discrepancies. Discrepancies might indicate possible missing values. If you have missing values, you will want to determine if the missing records are localized in any way or if they are random. From there, you will want to determine what you intend to do:
- Can the missing values be ignored? Do you intend to omit records with missing data?
- Do you intend to replace the missing values with substitute values?
Any known missing values that are ignored or omitted should be highlighted in the dataset’s disclaimers, under completeness. Where substitute values are used in place of missing values, it is important to make note of it under limitations.
Data ambiguity arises when what the data represents is not precisely defined due to incomplete or conflicting definitions, subjective differences in evaluating data, or where a record could be classified in more than one way. This can lead to data values being misinterpreted. Most issues pertaining to ambiguity will need to be addressed at the point of data collection or data entry. However, some steps can be taken to address some ambiguity. For instance:
- Abbreviations and acronyms can cause ambiguity as they can mean different things (e.g. DHS can represent two different government agencies: Department of Homeland Security at the federal level or Department of Human Services at the state level). Can you take steps to expand abbreviations and spell out acronyms before you publish your data?
- Dates with the year represented as two digits are another form of ambiguity that you should address before publishing your data. Can you convert two digit years to four digit years before publishing your data?
- Many words can have multiple meanings. Can you provide data dictionaries and/or highlight the meaning of the extensively used term in the data’s metadata to ensure you provide appropriate context?
It is worthwhile to check your data to ensure the values provided in your data are consistent with the data type assigned. For instance, text in numeric fields will not be imported correctly. Also, are values consistent with what you would expect to see (e.g. zip codes with something other than five or nine digits, phone numbers with less than 9 digits, etc.)
To insure data is useful to citizens and the public, you will want to verify that your dataset contains the variables (i.e. columns) you expected it to.
A variety of data types need to be formatted in specific ways to be imported successfully into data.iowa.gov.
Any checks and/or corrections that need to be made data updates should be documented
 Quality assurance encompasses everything from how your data is collected, stored, managed and assessed.
 With 1 to 1 joins, one record in the primary table matches to one record in the joined table. With many to 1 joins, many records in the primary table matches to one record in the joined table.
|Program Area|| |
Open Data, Quality