Boek/BrentjensEtAl-Geschiktheid

< Boek

The suitability of formats for Linked Open Data

 [bewerken]

Auteurs

Thijs Brentjens (Geonovum)

Marcel Reuvers (Geonovum)

Clemens Portele (Interactive Instruments)

 

When publishing or using Linked Open Data one encounters several technical aspects. An important aspect is the format for publishing data in. The W3C definition of linked data has a strong preference for RDF for encoding information about things. However, for many users it is easier to publish and/or process other formats. This paper deals with a wide range of formats and encodings, such as PDF, JSON, RDF, CSV, GML and metadata formats, and their suitability as format for linked open data. Suitability is expressed in terms of machine and human readability, abilities to structure data and openness of a format. In addition it is essential to allow for linking to other data and the other way around: to enable others to link to the data. This paper covers these aspects to provide an overview of the suitability of much used encodings and formats for Linked Open Data.

 

he W3C definition of linked data has a strong preference for RDF for encoding information about things. However, for many users it is easier to process other formats, e.g. CSV, JSON or XML-based grammars. Therefore, it is usually advisable to publish data in other formats, too. This technical paper provides an overview of relevant formats and provides general recommendations about their applicability in the context of Linked Open Data with special attention to the capability to represent links.

 

Five star model
[bewerken]

Over time, openness of the data was identified as a key element. In 2010 Berners-Lee introduced a five star rating for Linked Open Data [Berners-Lee]:

 

[vijf sterren opsomming van 1 naar 5]

Available on the web (whatever format) but with an open license (Open Data)

All the above, plus available as machine-readable structured data (e.g. Excel instead of image scan of a table)

All the above, plus non-proprietary format (e.g. CSV instead of excel)

All the above, plus use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff

All the above, plus: Link your data to other people’s data to provide context

 

This five star model describes some requirements and recommendations on data formats and APIs for linked open data. The rest of this chapter deals with these recommendations, to facilitate the choice for data formats when publishing data as linked open data.

 

Suitability of a format
[bewerken]

W3C and the star model of Tim Berners Lee mention RDF as a framework to use for Linked Data. But other formats than the most used RDF encodings, might be easier to publish data in and might be suitable as well. For example, URIs can be used with several formats to name things. And URIs can be used to create links between objects in several formats.

 

In the context of linked data, there are some considerations when choosing a data format (or multiple formats) for publishing. Looking at the star model and considering the fact that data is published on the web, the format should ideally (at least):

  • be machine readable;
  • allow for publishing structured data;
  • be non-proprietary, so humans and machines can read the data with the software of their choice;
  • be easy to read or process, using common and easily available (web) tools, browsers and technologies;
  • allow to express links, both the link itself and provide information on the link (semantics), e.g. the type or role of a link (see http://www.iana.org/assignments/link-relations/link-relations.xml) and a human readable title;
  • allow for linking to the data and/or objects, so a format should allow identifying objects / properties, preferably directly with URIs.

 

This paper explores some commonly used formats for publishing data as linked data.

 

General purpose data formats
[bewerken]

Publishing structured data on the web is often done using XML formats, ATOM/RSS, spreadsheets or JSON. For the pilot Linked Open Data it is worth discussing these and other commonly used formats and RDF in both XML and Turtle / TTL encoding.

 

 

Raw data in tabular formats: spreadsheets and CSV
[bewerken]

If data is available in tables, spreadsheets and CSV (Comma Separated Values) are often used to publish that (raw) data on the web. For many data publishers, uploading a Microsoft Excel spreadsheet to some website is a first and easy step. While this is easy to do and spreadsheets are flexible and powerful, this format has the main disadvantage that it is a proprietary format and requires special (heavy) software to read it: some office software.

 

If just the raw data is published as CSV, many tools can read or process the data. Varying from plain text editors, to lightweight software libraries, to end-user applications. A disadvantage (albeit for the slightly more advanced use cases) is that it is hard to validate spreadsheets and CSV data.

 

Using links is basic: links that point to other resources could be stored as values of ‘cells’ to, but this is limited to providing the link itself. There is no standard way of providing semantics on the link. Linking to the data in the document is limited to linking to the location of the entire document. Linking to individual objects (rows) or properties (cells) is not supported by these formats.

 

XML
[bewerken]

There are many XML-based data formats around. XML is designed to be machine readable, to allow publishing of data in a structured way. The structured character of XML can be seen in its syntactic rules and XML schemes. Domain specific languages make use of this, for example KML and GML for geospatial data.

 

There are lots of software packages and libraries to deal with XML. Often, some domain specific languages are supported. Browsers and text editors can open (non-binary) XML documents directly, although making it easily readable for humans requires some processing or understanding of the format. XSLT is a language to transform XML data to other formats, for example to present it as HTML in a web browser.

 

A powerful way of describing links in XML is with XLink. However, support for XLink in XML grammars and tools is limited. XLink defines XML elements for describing a link and (semantic) information on that link, to describe what the link means and how to deal with a link. For example, using the attributes xlink:href, xlink:type and xlink:role.

 

If an XML document is available on the web, linking to a fragment of the data in the document can be done using the standard XML ID attribute. This enables linking to objects, encoded in XML.

 

Publishing data with ATOM
[bewerken]

ATOM is an XML format to syndicate resources on the web. This is also known as a web feed. These feeds contain descriptive elements of resources and links to them and other relevant information. ATOM is used, for example, for syndication of news or blog posts. The data the feed points to could be in another format.

 

Support for ATOM feeds in software libraries is available and many web browsers have native support for ATOM. They can read (parts of) an ATOM-document and display them in a human-friendly way.

 

ATOM has extensive capabilities for (typed) links. These links can be used to point to other resources, metadata or other feeds. The semantics of the links can be expressed, for example: the media type the link points to, the relation type and a human-friendly title.

 

A feed can be linked to by using its identifier whose content is an IRI (a form of a URI). ATOM feeds offer entries with an identifier. Since this identifier is an IRI/URI, it is possible to link to an entry. These identifiers are not necessarily resolvable (with HTTP).

 

RDF: XML format and Turtle
[bewerken]

In the context of Linked Data, RDF is not to be left out. In itself RDF is not a format, but a framework to model information. RDF uses triples of a subject, predicate and object to model information on the web. These triples may use URIs, to identify all three the components.

 

[Bol-pijl-bol: subject predicate object]

 

Figure 1: RDF triple

 

In fact, RDF heavily relies on URIs for referring to and from other information. Also, the characterization of the predicate is referenced using URIs. RDF uses URI-based vocabularies to do so. In other words: using RDF allows for linking to and from other data. In this respect, RDF is a very suitable format for publishing linked data on the web.

 

SKOS (Simple Knowledge Organization Systems) provides a standardized way to represent knowledge organization systems with RDF. SKOS aims to facilitate the sharing and linking of data. SKOS defines a data model for this and uses RDF.

 

There are several RDF encodings, most notably an RDF/XML (http://www.w3.org/TR/rdf-syntax-grammar/) encoding and Turtle (http://www.w3.org/TR/turtle/).

 

There is support in tooling for both RDF/XML and Turtle, mostly for creating RDF and for processing it using software libraries. Using common XML-tools and technologies, RDF/XML can be processed. Turtle requires some specific ‘parsers’. Support for RDF in web browsers to present it to users is very limited, mostly to default XML displaying, which makes it difficult to use for end-users. Support in clients (both general and spatially enabled clients) is very limited. Compared to other formats like CSV and ATOM and some geospatial formats as will be discussed later, RDF is harder to use at the moment.

 

For validation and testing, some tooling is available. For example, the W3C RDF validator checks predicates for RDF-XML: http://www.w3.org/RDF/Validator/.

 

JSON
[bewerken]

Many websites and APIs use JSON to represent data. JSON is a text format that can be easily read (parsed) by machines. It is often used in web applications and preferred over XML (for example), since it uses an encoding similar to JavaScript, that web browsers can parse directly. Many libraries offer support for reading / processing and writing JSON data. Just as with XML, plain text editors and web browsers can display the data in raw form, but interpreting the data requires some knowledge of the format. There are some plug-ins and tools that display JSON in a more user-friendly way, but in general support in clients that end-users use is limited.

 

JSON does not have a specific type to express links or triples. There is discussion on this in the web community (For example http://bit.ly/rxDIiE) and some draft mechanisms/formats are defined (like JSON-LD or HAL), but none of them is prevailing at the moment. So there is no practically accepted or standardized way for links and describing link semantics in JSON currently. Identifying objects in a JSON document is not standardized either. JSON as such is a suitable format to publish data in, so that others (applications) can consume it, but for linked open data, JSON is less suitable – at least for now. This will likely change due to the popularity of JSON; for example, Facebook provides access to their linked data via their Graph API in both JSON and Turtle.

 

Documents
[bewerken]

Documents as intended in this section are generally not used to publish raw, structured data, for further processing (by machines), but provide a more human-friendly presentation of information.


 

(X)HTML
[bewerken]

HTML is the markup language of the web. There is no need to say that HTML is one of the most common formats on the web for publishing (human readable) information. For publishing raw data HTML is generally less suitable. HTML itself can use links to other documents and parts of documents, for example to a specific section in a document using anchors (#). Expressing some link semantics in XHTML, like the media type and role is possible.

 

If used properly, HTML can be processed by machines quite well. Search engines for example parse HTML.

All kinds of tools and software libraries support HTML. For publishing documents in the context of linked data, HTML is suitable.

 

When using XHTML, one can add metadata about the document with RDF, and provide links to other resources, for example to provide information about the author, the subjects of the document and the sources used. This is discussed at the web page http://www.w3.org/MarkUp/2004/02/xhtml-rdf.html.

 

PDF
[bewerken]

For documents on the web, PDF is a popular format. PDF started as a proprietary format, but is an open standard since 2008. Most users are able to directly open a PDF document, for example in their browser or using a viewer / reader. PDF documents are often used to publish a read-only document. The formatting of the document is preserved when opening. PDF is not designed to provide raw, structured data.

 

Creating or saving documents as PDF can be done with all kinds of software and software libraries. Some libraries support the processing of PDF as well.

 

Linking to a PDF document is easy, but it is not possible to link to fragments of a PDF document. PDF documents may contain links to resources on the web. Link semantics are hard to describe. PDF is a suitable format for publishing a document on the web for humans, but for publishing raw data and if advanced linking capabilities are required it is better not to use it.

 

Microsoft Word
[bewerken]

Publishing a document in Microsoft Word is not preferred, since this format requires additional software (office software or plug-ins) and is a proprietary format. It is harder to process data from the documents automatically. The linking capabilities are comparable to PDF.

 

Images and graphics
[bewerken]

 

Image formats
[bewerken]

In the context of linked data, image formats like PNG, JPEG and GIF can be treated similar. Linking to them is done in a regular way. Displaying images in these formats as well: web browsers and many other tools can deal with the formats. For further processing of the data they are less suitable. The raw data is hard to extract from the images. Linking to data in the images and linking to other information from the image is not directly possible. So for publishing data that requires some linking capabilities, these formats are not suitable.

 

SVG
[bewerken]

SVG is an XML format for 2D vector and raster graphics. Linking to an SVG document is done the same as with images: just point to the URL of the SVG-file. However, SVG also supports fragment identifiers to point to parts of the SVG file or objects. SVG objects themselves may contain an HTML-link. For presenting data with links and pointing to graphical objects (in a file), SVG could be a suitable format.

 

For example:

 

Some browsers have native support for SVG, but others require a plug-in. As such, publishing data in SVG is less suitable in the context of Linked Open Data.

 

Geographic information
[bewerken]

This section focuses on data formats for geographic information. When discussing client capabilities, these are considered to have capabilities to process geographic data.


GML
[bewerken]

Looking at the considerations for suitability for publishing linked (open) data, GML fulfills many of them. The format is an open standard, it's readable (since it is XML) and provides the ability to link to other information.

 

GML's model bears similarities with RDF's model. GML uses XLink, so that it is possible to reference remote features and properties in a GML document. In other words: it is possible to link to information in other GML documents. Describing link semantics like role and title is possible as well. GML thus offers extensive linking capabilities.

 

Support for GML in clients differs. Support is mainly found in the geospatial domain only. Many desktop clients support GML to some extent, for example reading and writing Simple Features GML. Native support for complex GML and resolving links is limited in clients. In geospatial software libraries and conversion tooling there is support for GML, often simple features only, sometimes more. Some libraries support resolving Xlinks. Support for complex GML is less widespread.

 

OGC offers a GML validator as part of their compliance testing suite, currently in beta testing, at http://bit.ly/ZZSNkw.

 

KML
[bewerken]

Where GML is concerned with the data only, KML focuses on visualization of geographic information and interaction with the users. KML is designed to be used in geographic browsers like Google Earth and adopted by the OGC. Besides in Google Earth, support for KML (sometimes limited) is offered by other freely available clients, like Google Maps, in geospatial software and software libraries.

 

KML is an XML format and has some resemblance with GML. KML is not as rich in expressing geospatial data as GML is. Objects in KML can have an XML ID, so linking to a specific object in a KML file is possible. Linking from a KML document to other information is possible (e.g. in the description of an object or from a document to another document), but not as strong as in GML.

 

For publishing data to end-users, KML is a suitable format. If some more advanced linking capabilities are required, KML is not the best to use.

 

WKT/WKB geometry
[bewerken]

To represent the geometry of geographic data in a text encoding the Open Geospatial Consortium has defined Well-Known text (WKT). Several databases support this format and its binary equivalent (Well-known binary, WKB) to transfer and store geometry. Some client-software supports WKT as well. WKT and WKB are only intended for geometry representation and not (entire) objects or features. Therefore linking capabilities are not relevant for WKT/WKB geometry.

 

GeoRSS
[bewerken]

GeoRSS is an extension to add location to regular ATOM (and RSS) feeds and entries. The linking capabilities of GeoRSS are therefore the same as with ATOM feeds, see the ATOM section before. GeoRSS support is offered by several software libraries, clients and freely available online mapping platforms, like Google Maps.

 

Shapefile
[bewerken]

Although much used and with a publicly available description, the shapefile format formally is not an open format. In practice there is lots of support for it in GIS clients and software libraries. However, if no GIS tooling is available, data in a Shapefile is not easy to read / process, because a shapefile in fact consists of several mostly binary files.

 

Links could be provided as attribute-values of an object, but there is no native support for links and semantics of links. Linking to the file is more difficult. It is not possible to use a single URI to refer to a shapefile, unless it is a compressed / ZIP-ed file, which can't be read directly.

 

GeoJSON
[bewerken]

GeoJSON is a format for encoding a variety of geographic data structures in JSON. GeoJSON focuses on the spatial properties of the data. See the section on JSON for general characteristics of JSON, like the linking capabilities.

 

Several software libraries and APIs (like Google Maps, Bing Maps) offer direct support for reading and/or writing GeoJSON. Some geospatial ETL software and desktop clients are able to consume GeoJSON.

 

GeoServices JSON
[bewerken]

GeoServices JSON is a specification of Esri, for formatting all kinds of objects in JSON, including geospatial features. It is part of the GeoServices REST API. The GeoServices REST API currently is supported by Esri's ArcGIS software including the online platform ArcGIS Online as well as other geospatial client and server software.

 

Similar to GeoJSON, GeoServices JSON has no additional linking capabilities, so for the linking capabilities see the section on JSON.

 

Metadata
[bewerken]

Metadata are used to provide information on data and/or the structure of data. For linked data and the standards as dealt with in this document, the following sections describe relevant standards / specifications to provide metadata.

 

Dublin Core
[bewerken]

The Dublin Core metadata element set defines general metadata elements. For example title, subject, description, publisher, date and rights. Dublin Core metadata can be used for multiple purposes. There are syntaxes for RDF, HTML, an XML format and plain text. Dublin Core is often embedded in ATOM and RDF documents to annotate them with metadata.

 

Standardization organizations like IETF and ISO use Dublin Core. In the Netherlands OWMS (the Overheid.nl Web Metadata Standard) is based on Dublin Core, for metadata on information of Dutch governments on the internet.

 

VoID
[bewerken]

The Vocabulary of Interlinked Datasets (VoID) is concerned with metadata about RDF datasets (http://www.w3.org/TR/void/). It is an RDF Schema vocabulary that provides terms and patterns for describing RDF datasets, and is intended as a bridge between the publishers and users of RDF data. VoiD descriptions can be used in many situations, ranging from data discovery to cataloging and archiving of datasets, but most importantly it helps users find the right data for their tasks.

 

VoiD covers four areas of metadata:

  • General metadata following the Dublin Core model, e.g. where to find a SPARQL endpoint, data dumps;
  • Access metadata describes how RDF data can be accessed using various protocols, e.g. where to find a SPARQL endpoint, data dumps;
  • Structural metadata describes the structure and schema of datasets and is useful for tasks such as querying and data integration, e.g. metadata on the URI pattern, examples, used vocabularies, statistics;
  • Description of links between datasets are helpful for understanding how multiple datasets are related and can be used together.

VoID leverages Dublin Core and FOAF vocabularies and is used by several providers of RDF datasets.

 

Microdata, Microformats, RDFa
[bewerken]

Metadata can be provided as separate, standalone documents (accompanying or referring to the document / resource somehow) or in documents themselves. For (X)HTML and XML documents a commonly used approach is tagging (annotating) the documents with metadata elements. Microdata, microformats and RDFa are specifications to accomplish this.

 

They are often considered relatively simple mechanisms to provide metadata and better semantics to documents. They define the usage of certain attributes, with standardized values and elements in (X)HTML and XML, to embed (meta)data on commonly published things like people, addresses or blog posts directly in the document.

 

The contents of the document generally don't change. The formats only add extra attributes and elements to annotate parts of the document as being (meta)data, making them more machine-readable. Search engines often index this information (see for example http://support.google.com/webmasters/bin/answer.py?hl=en&answer=99170) and browsers can extract it to provide extra or better functionality.

 

ISO 19115
[bewerken]

ISO 19115 defines the schema required for describing geographic information and services (source: ISO 19115:2003). It provides information about the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data.

ISO 19115 metadata is used to describe datasets for discovery, for evaluation or in order to access the geographic data.

 

There is an XML encoding for ISO 19115. Metadata on geographic data are often provided as standalone document for datasets (for example available in a catalogue service / metadata repository) or embedded in a GML feature for feature-level metadata.

 

References
[bewerken]

Berners-Lee, Tim. Linked Data, 2006. Retrieved June 7, 2013, from http://www.w3.org/DesignIssues/LinkedData.html