What makes data difficult to publish?
Difficulty publishing data is often tied to how easily the data can be extracted and the quality of the data itself. The following questions are intended to help guide your assessment:
|Is your data structured?||If your data is contained in a fixed field with in a record or file (e.g. contained in a database or spreadsheet), it will be much easier to publish compared to data in a document or paper format. Structured data typically has a data model that defines the fields data will be stored in and specifies how the data will be stored (e.g. data type – numeric, date, text, etc.). Structured data is much easier to extract from its data source(s) – thus making it easier to publish.|
|Is your data complete?||Missing or incomplete data prevents users from being able to effectively aggregate and compare values. If your dataset is missing relevant data values, it may be necessary to complete records from other sources (e.g. paper records or electronic documents). The more extensive or widespread the gaps in your data are, the more difficult it may be to publish your dataset.|
|Is your data unambiguous?||Data ambiguity arises when what the data represents is not precisely defined. This can lead to data values being misinterpreted. For example: DHS can represent two different government agencies: Department of Homeland Security at the federal level or Department of Human Services at the state level, or a reference to a person (e.g. Smith, J.) could be associated with different individuals. Having to correct ambiguity in your data will make it more challenging to publish.|
|Is your data consistent?||Data inconsistency occurs where data values refer to the same thing but are recorded differently. For example, Mt. Pleasant and Mount Pleasant refer to the same town in Iowa, and Dept. of Public Health, DPH and Iowa Department of Public Health all refer to the same state agency. However, since they are not recorded in a consistent way, values cannot be properly aggregated and compared. Correcting data inconsistencies can make publishing your data more challenging.|
|Is your data redundant?||Redundant data usually occurs where data values for the same thing are recorded in multiple places. This could potentially lead to contradictions in the data. For example, if vendor addresses are entered on individual financial records, rather than in a unique vendor record, there is the potential for different addresses being recorded. When this happens, data users would not know which address is the correct one. Having to determine which data is correct will make publishing your data far more challenging.|
|Is your measured data based on a standard?||Measured data lacking a standard reference method or measurement protocol lends itself to uncertainty, as it cannot be easily replicated. Additionally, if measured data lacked a standard and was collected by multiple individuals – the accuracy of the data becomes questionable. This is perhaps the most difficult data quality issue to deal with.|
|Is your data confidential or sensitive?||If your dataset contains confidential or sensitive data (e.g. data protected by state law, such as Iowa Code Section 22.7 or other applicable Iowa Code section, or federal law, such as the Health Insurance Portability and Accountability Act, Social Security Number Protection Act, and Family Educational Rights and Privacy Act), it will be more difficult to prepare for publication. De-identification and other disclosure requirements can greatly increase the burden of publishing the data for public use if protocols and procedures have not been developed. Confidential or sensitive data in some cases could prevent a dataset from being published as an open dataset altogether.|
|Are there existing processes in place for publishing your data?||Your agency may be able to leverage existing processes to publish the data, such as exports for periodic department reviews, or routine exchanges of data with other agencies (e.g. data sharing agreements). It would also include any quality assurance processes to verify the quality and integrity of your datasets. Having such existing procedures in place may make the data easier to publish.|
The Data Inventory-Priority Matrix Worksheet provides a tool to help you conduct a difficulty assessment to help prioritize datasets for publishing.
|Program Area|| |
Open Data, Data Extraction, Data Quality, Difficulty Assessment, Data Identification