How do I prepare a dataset to publish on the Open Data Portal?
There are a number of things you should consider when preparing datasets to publish to ensure they are relevant to potential users.
Your dataset is ready when it:
- Provides sufficient level of detail - granularity
- Is based on queries with proper joins
- Contains headings
- Provides a key field where available
- Includes numeric, contextual, and categorical data
- Sums across columns and not down rows
- Includes addresses or other location data if available
- Is free of decorative formatting
- Logically orders columns
- Uses vertical data orientation
There are no hard and fast rules about what level of detail is sufficiently granular to add value to a dataset. Users of data.iowa.gov may come from a variety of fields and specialties, including developers, and academic and other government users who can envision a use for the raw data not originally anticipated. Raw or very granular data also provides more options for presenting or summarizing the data (i.e. looking at it in different ways). You should strive to provide the level of detail in your datasets to where each row in the dataset represents something (e.g. crime, inspection, permit, license, etc.), rather than showing aggregate totals or subtotals. By providing granular datasets, you facilitate making data more meaningful to a wider cross section of the public, and can maximize the use of visualization tools available with data.iowa.gov. However, safeguards to prevent the disclosure of personally identifiable information and other confidential data may limit how granular a dataset may be.
Many datasets will be created with the use of complex queries involving joins of two or more tables. Joins have the potential to affect the number of rows returned and ultimately the accuracy of the dataset. To help avoid problems, it is a good practice to begin joins with the most granular table or query. Often times, this table or query will contain the key Numeric Data for your dataset that you want citizens to be able to summarize (e.g. count of offenders in prison, sum of payments made, etc.). By beginning joins on this table or query, you can better trust that any aggregates on your primary table or query will be accurate. Oftentimes, the tables or queries you join to the primary table or query will contain Contextual Data or Categorical Data (e.g. institution, payee, etc.), which facilitates summarizing the data.
Problems can occur if one row in your primary table or query can match up with multiple rows in a joined table or query. This can lead to duplicate numeric values, which would cause inaccuracies with any aggregate function.
Your agency must include column identifiers (i.e. headings) in the first row of your file for each column containing data. When creating headings, you should:
- Keep them short and meaningful
- Remove merged cells
- Keep alpha characters lower case
- Avoid the use of symbols in headings (e.g. &, %, *)
- Use underscores in place of spaces in headings
- Ensure each column heading is unique
This will allow the system to use the headings you provide as the API field name for the column, and will facilitate Automating Updates in the future. In the system, you will have an option for Changing Column Properties and provide more descriptive or human readable labels and definitions – both of which can further expand users’ understanding of your data.
When possible, you should include a column that uniquely identifies each row in the dataset. This value should be unique and not be contained in more than one row. Values are typically numbers or some alphanumeric code. Providing row identifiers can help facilitate programmatically and automatically updating datasets. A row identifier will also allow developers to use this column to power applications. That way if columns are deleted or added, developers are ensured that the applications built off of the row identifier will not break. It can also help you later on where specific records in the dataset need to be corrected or replaced due to errors.
It is highly recommended that you use numeric and date data types whenever possible. This allows the system to perform calculations on numeric data to create charts, and use dates to build calendars that is not possible with plain text. If you have rows (or records) with missing values, due to either data collection issues, or for confidentiality reasons, it is recommended that you create another column to flag or tag the record and briefly define the reason for missing data.
Identification numbers and numeric codes such as state county codes, FIPS codes, etc., where leading zeroes are significant, must be provided as text values to prevent the leading zeroes from being truncated. This can be a problem particularly when Excel is used to prepare the data, as Excel defaults to the “General” format for columns and assumes that any value that looks like a number must be a quantity, resulting in truncation of leading zeroes.
It is important that you include information in your dataset that give numeric data context. For instance, a dataset containing payment transactions is far more meaningful if the payee, transaction date and expense description is also provided.
Categorical data is often broader groups or categories for the contextual data previously noted. Grouping data is often necessary to effectively calculate (e.g. sum or average) numeric data and provide meaningful and understandable summaries for citizens. While some users will want access to the “raw” data, most citizens will find visual summaries, such as a chart or map, far more helpful. Categorical data facilitates grouping data. Reference tables in source systems often provide the categorical data you will want to include. You can also review your data to see if column headings can be transformed into data values (e.g. store dates in rows, not columns).
It is best to only store the raw data and leave calculations up to the system. So, don’t include those rows providing subtotals or totals for a group of records (i.e. rows). This makes your data more flexible by allowing it to be summarized in many different ways. However, having a column that sums or totals values from other columns is fine.
Being able to associate the records (i.e. rows) in your dataset to a geographic location can allow you and users of data.iowa.gov to build maps using your data. Records can be tied to a geographic location using a full or partial addresses, or more complex geometric structures (i.e. lines and polygons, see file formats).
Clean out all the extra decorative rows and columns that make an excel document look great, leaving just one set of column headings and no indents, asterisks, or merged cells.
The following provide general rules of thumb related to how columns should be ordered within your file to provide some uniformity to how data is presented.
- Column containing Row Identifiers should be the left most column in the file
- Columns containing Categorical Data and Contextual Data should be placed to the left of columns containing Numeric Data
- Columns containing alphanumeric codes should be placed to the immediate left of columns containing the labels for the codes (e.g. column containing department numbers should be placed to the left of the column containing the department names)
- Columns containing Categorical Data that are hierarchical should be adjacent to one another. Additionally, the parent column should be placed to the left of the child column (e.g. column containing department names should be placed to the left of column containing division names)
- Aggregate columns totaling Numeric Data from other columns should be placed to the right of columns included in aggregate total
Columns may be reordered after import. However, creating files with columns ordered in the manner above will simplify updating your data in the future.
Datasets with a horizontal data orientation should be restructured to vertical whenever feasible. Presenting, consuming, using and refreshing the data becomes more difficult as data files get wider with numerous columns. Datasets with a vertical data orientation are more understandable, filterable and sortable, and are much, much easier to update. They can also be more easily rolled up and aggregated. Vertical data orientation better supports creating visualizations on the open data portal, and using any number of third party business intelligence and/or data analysis visualization tools to perform analytics, identify patterns and trends over time.
Horizontal Data Orientation
|Iowa State University||2015||30034||5096||584||287||36001|
Vertical Data Orientation
|Iowa State University||2015||Undergraduate||30034|
|Iowa State University||2015||Graduate||5096|
|Iowa State University||2015||Professional||584|
|Iowa State University||2015||Post-Doctoral||287|
 Making the most granular table/query the primary table will often facilitate creating 1 to 1 joins, where one record in the primary table/query matches to one record in the joined table/query; or Many to 1 joins, where many records in the primary table/query matches to one record in the joined table/query.
|Program Area|| |
Open Data, Datasets, Relevency, Importing Data