Provenance

Provenance is one kind of metadata which tracks the steps by which the data was derived and can provide significant value addition in data intensive scenarios [1]. Data provenance, describes the derivation history of a data product starting from its original sources. It is a collective term for all aspects related to traceability, responsibility, auditability, accountability and accuracy of data. Provenance gives an important indication about the reliability of the data and is very important for the re-use of Linked Data. The linking and combination of different data sets, which might even result in editing data sets, has huge effects on the reliability of the new data sets. Recently, the PROV vocabulary got standardized for Linked Data by W3C. The vocabulary can be used to express provenance metadata in Linked Data.

W3C PROV standard: Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. The goal of PROV is to enable the wide publication and interchange of provenance on the Web and other information systems. PROV enables one to represent and interchange provenance information using widely available formats such as RDF and XML. In addition, it provides definitions for accessing provenance information, validating it, and mapping to Dublin Core.

Dublin Core defines provenance as: “A statement of any changes in ownership and custody of the resource since its creation that are significant for its authenticity, integrity, and interpretation”. A collection of literature about provenance, structured according to three dimensions (content, management and use) is provided by W3C. Open Provenance Vision is a vision of a set of architectural guidelines to support provenance inter-operability, consisting of controlled vocabulary, serialization formats and APIs.

The simplest way to use PROV is through one of the many applications, such as ProvStore, that support it. Questions that one needs to answer when describing provenance include the following:

  • Who created that content (author/attribution)?
  • Was the content ever manipulated, if so by what processes/entities?
  • Who is providing that content (repository)?
  • What is the timeliness of that content?
  • Can any of the answers to these questions be verified (for example by e-signatures)?


Metadata and especially provenance are essential when publishing datasets to ensure re-usability and value creation. Once metadata is defined and added to the dataset it can actually be published.

[1] Simmhan, Yogesh L., Beth Plale, and Dennis Gannon. "A survey of data provenance techniques." Computer Science Department, Indiana University, Bloomington IN 47405 (2005).

Go back to Metadata overview