What does it mean to Publish Data?

When data move across organizational boundaries a unique, but consistent, set of issues arise around access controls, provenance and terms of use.

Who publishes Data?

If you’ve ever shared a spreadhseet or uploaded a photo and added a caption to it, then you have published data.  Anyone who shares information online is publishing data.  The more valuable the data are to you, and the more complicated or nuanced your messages, the more complicated this process becomes.

The World Bank publishes data about international development.  Public libraries and Museums publish data about their collections and library catalogs.  Manufacturers publish data about their products.  Police Departments publish data about crimes.  Within a large company, one group such as HR will publish data for other groups to consume.

In all these cases, data are moving across organizational boundaries and in all of these cases they grapple with the same basic issues of access controls – deciding who is allowed to read or edit this information, and which portions of the information they are allowed to access or modify, provenance – knowing where information came from and whether it’s trustworthy, and terms of use – dictating how the information can/should be used.

How do we publish data now?

Currently, we primarily publish data using spreadsheets, custom databases or (rarely) Semantic Web technologies.  These are all very useful technologies and each of them addresses some aspects of data curation, but none of them gracefully supports the needs of data curators, especially when it comes to access controls, provenance and terms of use.

Problems with the existing solutions

The problems with using spreadsheets, custom databases or Semantic Web technologies to publish data fall roughly into three categories – Structural, Behavioral, and Security.

Structural

We use data to represent people, places, things and concepts.  Those representations manifest as a mix of structured information (ie. names, dates, locations, numbers) and file content (images, videos, text files, audio).  The natural way to represent this information is as a network of things — a network of Entities that have any number of characteristics and any number of relationships to other entities.

Spreadsheets and Relational Databases prevent this kind of representation because they force you to map information into either grids of rows, columns and formulae (spreadsheets) or tables, rows and foreign keys.  By contrast, Semantic Web technologies embrace the notion of data as network but they tend to become bogged down in the idiosyncracies of formally representing networks of things, which prevents fluid, natural expression and exchange of information.

Behavioral

Most collections of data change over time, with multiple contributors making those changes.  Also, in many cases there are divergent perspectives about what is valid or accurate or relevant with respect to a given collection of data.  This means that in order to curate data, you must have some form of version control – tracking what changes have happened and when they happened, some means of tracking provenance – knowing where information came from, and ideally forking – allowing for multiple divergent versions of the same data to exist at the same time without confusing provenance.

None of the existing ways of publishing data accommodate these behavioral problems natively.  You must accommodate these needs in how you use the technologies — naming spreadsheet versions, creating documents or databases to track edits, etc.

Security

The final, and often the most prevalent challenge with using these existing technologies to publish data is security.    Who should be allowed to modify the data and when they can they make modifications? Which audiences can see the data?  When can they have access?  Which versions of the data should they have access to?  Is there information that should be hidden from some audiences but shown to others?

How DataBindery solves these problems

Visit our features list for a full explanation of how DataBindery uses a novel combination of linked data, version control, access controls, and NoSQL search to provide the tools that Data Curators need to publish their data.