Who publishes Data?
If you’ve ever shared a spreadhseet or uploaded a photo and added a caption to it, then you have published data. Anyone who shares information online is publishing data. The more valuable the data are to you, and the more complicated or nuanced your messages, the more complicated this process becomes.
The World Bank publishes data about international development. Public libraries and Museums publish data about their collections and library catalogs. Manufacturers publish data about their products. Police Departments publish data about crimes. Within a large company, one group such as HR will publish data for other groups to consume.
How do we publish data now?
Problems with the existing solutions
The problems with using spreadsheets, custom databases or Semantic Web technologies to publish data fall roughly into three categories – Structural, Behavioral, and Security.
We use data to represent people, places, things and concepts. Those representations manifest as a mix of structured information (ie. names, dates, locations, numbers) and file content (images, videos, text files, audio). The natural way to represent this information is as a network of things — a network of Entities that have any number of characteristics and any number of relationships to other entities.
Spreadsheets and Relational Databases prevent this kind of representation because they force you to map information into either grids of rows, columns and formulae (spreadsheets) or tables, rows and foreign keys. By contrast, Semantic Web technologies embrace the notion of data as network but they tend to become bogged down in the idiosyncracies of formally representing networks of things, which prevents fluid, natural expression and exchange of information.
Most collections of data change over time, with multiple contributors making those changes. Also, in many cases there are divergent perspectives about what is valid or accurate or relevant with respect to a given collection of data. This means that in order to curate data, you must have some form of version control – tracking what changes have happened and when they happened, some means of tracking provenance – knowing where information came from, and ideally forking – allowing for multiple divergent versions of the same data to exist at the same time without confusing provenance.
None of the existing ways of publishing data accommodate these behavioral problems natively. You must accommodate these needs in how you use the technologies — naming spreadsheet versions, creating documents or databases to track edits, etc.
How DataBindery solves these problems
Visit our features list for a full explanation of how DataBindery uses a novel combination of linked data, version control, access controls, and NoSQL search to provide the tools that Data Curators need to publish their data.