Accessing data with HTTP, URIs and links

[bewerken]

Auteurs

Thijs Brentjens (Geonovum)

Marcel Reuvers (Geonovum)

Clemens Portele (Interactive Instruments)

HTTP URIs are important building blocks of linked open data, on the web. An HTTP URI serves as an identifier and as a reference to a description on the object identified by that URI. This paper discusses HTTP and URIs in the context of Linked Open Data from a technical point of view. Topics also include HTTP's abilities for redirecting a client (using HTTP 303) to information and for requesting different encodings of the information using Content negotiation. Web-based APIs (Application Programming Interfaces) enable users to work with data sets, for example to query data and retrieve parts of them. The paper discusses some of these APIs. Links alone are not sufficient, for humans and machines, to know where the link points to exactly; for example to determine if it is useful to follow that specific link. Semantics about the link can help to understand the link better.

URIs and HTTP play an essential role in the context of Linked Open Data on the web. URIs can be used to identify objects globally and allow for linking data on the internet. This paper provides an overview and explains the underlying concepts and how they are used in practice in the web today. After discussing HTTP and URIs as identifiers, it deals with resolving URIs and retrieving data using APIs. In order to understand what the URIs and links refer to, the paper concludes with a description on how to provide link semantics.

HTTP URIs to identify things on the web
[bewerken]

URIs are important building blocks of linked open data, as expressed in the first two principles of Linked Data:

Use URIs as names for things
Use HTTP URIs so that people can look up those names.

The online book ‘Linked Data: Evolving the Web into a Global Data Space’ [Tom Heath and Christian Bizer, 2011] states on this:

‘[…] Linked Data uses only HTTP URIs, avoiding other URI schemes such as URNs and DOIs. HTTP URIs make good names for two reasons:

They provide a simple way to create globally unique names in a decentralized fashion, as every owner of a domain name, or delegate of the domain name owner, may create new URI references.
They serve not just as a name but also as a means of accessing information describing the identified entity.’

Publishing URIs allows others to look up information on objects and link their data to objects from other collections. This chapter takes a closer look at HTTP and URIs. From now on in this document URI also refers to HTTP URI, unless stated differently.

In the Pilot Linked Open Data a draft URI strategy for e-government data in the Netherlands is developed. The URI strategy provides more details on the preferred URI pattern to identify things.

Identifiers
[bewerken]

In order to link to something on the web, it must be identifiable. [Tom Heath and Christian Bizer, 2011] describe this as:

‘To publish data on the Web, the items in a domain of interest must first be identified. These are the things whose properties and relationships will be described in the data, and may include Web documents as well as real-world entities and abstract concepts. As Linked Data builds directly on Web architecture, the Web architecture term resource is used to refer to these things of interest, which are, in turn, identified by HTTP URIs.’

Different organizations and different datasets within a single organisation may describe the same real-world thing; each within their own context. There could be for example a registration of a building by organisation A and one by organisation B, where B records more information on the buildings than A. Both refer to the same building and could use the same URI. By using the same object identifiers (in both registrations) these objects can be linked to each other. For Linked Open Data these identifiers take the form of a URI.

Closer look at URIs
[bewerken]

A URI is a string of characters to identify a name or a resource on the web [Uniform resource identifier]. A URI is either a Uniform Resource Name (URN) or a Uniform Resource Locator (URL).

HTTP URIs for an object generally consist of:

the scheme, for HTTP URIs this is ‘http’;
an authority, in practice this is a domain name and optionally a port;
a path to the object
(optionally) a query using parameters after a question mark;
(optionally) a fragment identifier, pointing to a specific part (fragment) of the resource.

Written as a URI, the above would result in a URI like:

http://{authority}/{path}?{query}#{fragment identifier}

When the fragment identifier is used to identify an object, the URI is called a Hash URI. This type of URI is not recommended to use when the referred object is part of a large collection, since this results in retrieving the entire collection, while only a small part is needed. (For a discussion on hash URIs versus so called 303 URIs, see for example the section ‘Hash versus 303’ in http://bit.ly/19kE2vk .) Therefore, it is recommended to use one URI for each object. This type of URI is called a 303 URI, which is explained later. This results in URIs like: http://example.com/id/roads/a12 to identify a road, with number A12.

[bewerken]

URI as a reference to a description
[bewerken]

More than an identifier
[bewerken]

The scheme ‘http’ suggests that the URI can be resolved to a document. Technically, it is not necessary for a URI to be resolvable. However, it is often expected by users that if a URI is simply opened in a web browser, some information is shown. URIs thus offer both a mechanism to identify a resource on the web and a means to get information on this object.

For example if http://example.com/id/roads/a12 is the identifier of a road, then opening this URI with a web browser results in a description, e.g. in XML or HTML of that road. Techniques to resolve the URI are subject of the next section.

[bewerken]

Multiple representations[bewerken]

There can be more than one description of a resource. For example descriptions in different formats (also called: media types), or other abstractions of the same real-world phenomenon in the context of different specific applications. So besides identifying the thing, there is a need to distinguish its different descriptions.

For example, a description of a geospatial object could be available in RDF-XML but also in GML (for direct usage in some geospatial software) and in an HTML page, to provide a human friendly description. Each of these documents is a different representation of the same object. The question rises how to ask for a specific media type.

Media types and content negotiation
[bewerken]

HTTP offers a way to ask for a specific media type, while using the same URI. This is done by sending extra information with the URI in an HTTP Header element, the Accept header. The server can use this to return the information in the appropriate media type. For example, if a software client such as a browser asks for text/html, the server is expected to return an HTML document to the client. If the same URI is provided, but the client requests application/xml, the response should be encoded as XML. This mechanism is called content negotiation.

For example, to ask for the HTML representation of an object identified by http://example.com/id/roads/a12 , the HTTP GET request could look like:

GET /id/roads/a12 HTTP/1.1
Host: example.com
Accept: text/html

And for a JSON representation it could be:

GET /id/roads/a12 HTTP/1.1

Host: example.com

Accept: application/json

Since a server is not obligated to support all requested media types, it is advisable for clients to ask for commonly supported media types.

Media-types of commonly used formats in are listed in table 1. For more media-types and formats, see http://www.iana.org/assignments/media-types. Not all formats have officially registered media types. As part of the implementation of the European INSPIRE directive to establish a European Union spatial data infrastructure, the need for a register for media types for geographic information has been identified. This register is at http://inspire.ec.europa.eu/media-types.

[table]

Format Media-type Remarks

ATOM application/atom+xml

CSV text/csv

Esri Shapefile Shapefiles consist of multiple files, so there is not one media type. INSPIRE uses ‘application/x-shapefile’ for a zip archive that contains at least the shp, shx and dbf files.

GeoJSON application/json

GeoRSS application/rss+xml or application/atom+xml GeoRSS has got a media type of its own, but since it extends RSS and ATOM, these media types are used.

GeoServices JSON application/json

GIF image/gif

GML application/gml+xml

HTML text/html

JPEG image/jpeg

JSON application/json

KML application/vnd.google-earth.kml+xml This is for uncompressed KML.

KMZ application/vnd.google-earth.kmz Compressed KML

Microsoft Excel application/ms-excel

Microsoft Word application/msword

PDF application/pdf

PNG image/png

RDF-Turtle text/turtle

RDF-XML application/rdf+xml

SVG image/svg+xml

XHTML application/xhtml+xml

XML application/xml Often XML is used to define other formats with their own media types

Table 1: Media-types for commonly used data encodings, for documents, alphanumerical and geographic data

Strictly spoken content negotiation is sufficient to ask for a certain media type. However, it is also possible to have the media type as part of the URI (for example when putting a URL in the address bar of a browser or when ‘clicking’ on a link). This has the advantage that it is already clear which format to expect. This approach is common practice on the web. It results in several URIs, one for each representation. Typically this is done using suffixes (e.g. ‘.rdf’, ‘.html’, etc.) or query parameters (e.g. ‘f=json’, ‘outputFormat=application/xml’).

For an HTML representation the URI of the object identified by http://example.com/id/roads/a12 could be http://example.com/id/roads/a12.html and for a JSON representation http://example.com/id/roads/a12.json.

Technology for resolving HTTP URIs
[bewerken]

HTTP Basics
[bewerken]

The most relevant HTTP methods for Linked Data are HTTP GET and HTTP HEAD. HTTP GET is used to retrieve the resource at the URI, for example a document. HTTP HEAD is used to discover if a resource is available at that URI or at another location (without retrieving it). Other HTTP methods, like POST and PUT, are intended for modifying data, which is out of scope for the pilot and this document.

If a client sends a HTTP GET request, a (web)server is expected to get that resource and return it to the client. Mostly, when a user enters a URI in a browser, an HTTP GET request is sent. The server uses codes to tell the client what the response is. For example, if the resource is found, the HTTP code ‘200’ is returned. If the resource is not found, an HTTP 404 code is returned. For all HTTP response codes, see the HTTP specification.

HTTP 303 redirects and 303 URIs
[bewerken]

If the resource is not available at the requested location, but the web server knows where to find it, the server responds to an HTTP GET or HEAD request with the location of the resource. This mechanism is also known as redirection. The HTTP code 303 is the appropriate code for redirects in the context of linked data. Note that for end users this redirection mechanism isn't visible in most cases. Web browsers resolve redirects automatically for the end user.

For linked data, 303 redirects are used to redirect a URI, which identifies an object, to a document about the object. URIs using 303 redirects are called 303 URIs [Tom Heath and Christian Bizer, 2011]. Conceptually this redirection is consistent. A web server can't return the (real-world) object that is identified by a URI, but it can say where to find information on that object. So if a URI is resolved, the response is a location of a description of that object.

Testing URIs
[bewerken]

The validator at http://validator.linkeddata.org/vapour allows for testing URIs (of resources) for being resolvable and using the 303-mechanism, for supporting content-negotiation and provides links to other validators. This validators helps to understand these mechanisms.

APIs for accessing resources
[bewerken]

[subpar tot Linking data: provide context to data]

Overview
[bewerken]

HTTP provides general methods to request and manipulate documents on the internet. For applications this may be sufficient, but (standardized) APIs can be very helpful for building applications, for example to perform queries on collections of resources. APIs are typically closely linked with a format or family of formats.

If an API supports the retrieval of an individual resource, through one (HTTP GET) request, URIs can be mapped to the API directly. This means that only a mapping between the API requests and URI pattern of the data might be needed to publish data for linking. Others can then follow these links to inspect or use the data directly.

SPARQL
[bewerken]

The most important query interface on RDF representations of data is SPARQL. This language, an official W3C recommendation, syntactically has some resemblance with SQL. With SPARQL a client can search through triple patterns, to retrieve data from collections in triple stores. SPARQL allows for federated searches, to get results from multiple SPARQL endpoints. Since SPARQL is designed for RDF and RDF relies on links, accessing linked data by means of SPARQL endpoints might be very useful.

Many semantic web tools and software libraries offer support for SPARQL.

The OGC has standardized an extension to SPARQL to support spatial predicates, GeoSPARQL.

WFS
[bewerken]

The most important query interface on GML representations of data is by Web Feature Services (WFS). WFS is an OGC [OGC 09-025r1] and ISO standard to retrieve and manipulate geographic information, and in particular GML, over the internet. To access geographic information, the WFS specification describes operations to perform queries on resource collections of geographic objects. Both desktop clients and software libraries in the geospatial domain support WFS.

One of the WFS operations allows to fetch a GML object using an HTTP GET request. If URIs are mapped to these requests, for example using 303 redirects of a URI to a WFS request, this could be used to access a GML representation of an object through a URI. The URI then redirects to the WFS request, which in turn results in a GML representation.

WFS also defines mechanisms to deal with Xlinks, to resolve local and remote references. This way data that links to other data might be resolved at once and queries may include predicates on linked objects, by a WFS server. For accessing (linked) geospatial data this can be a powerful and useful mechanism.

GeoServices REST API

The most important query interface on GeoServices JSON representations of data is the GeoServices REST API. The GeoServices REST API was originally developed by Esri for their ArcGIS software, to access geospatial data, in particular in GeoServices JSON format. The GeoServices REST API offers, among others, methods to query resource collections using HTTP.

The GeoServices REST API provides structured patterns for URIs to request resources, but does not offer any extra linking capabilities, or server side mechanisms to resolve links to (other) resources.

Linking data: provide context to data

To fully leverage linked data, it is important to link the data itself to other data: to create links from the dataset published and express what the link is about. This provides context to the data published and helps others understand the data. This section discusses what information on a link is important, to provide context.

Links to discover more

The fourth principle of Linked Data as defined by Tim Berners Lee is: ‘Include links to other URIs so that they can discover more things.’. Linking to other information is useful to provide context. Creating a link to another resource can be done for several reasons. For example:

[opsomming]

to define a relation with a resource (‘this object is part of object X’ or ‘this object extends object X’ or to add additional properties to a resource);
to point to a definition or type (‘this object is of type A, which is defined here’);
to point to associated information;
to refer to documentation in another format or language (‘for an HTML description in English, see here’).

Links help people and computers to better understand and process the data. Links provide context and allow for more and related information to discover.

Just providing the URI itself as a link, might not be enough for others to understand and use the link. Is the link for example an object identifier, a definition or a background information? For others to discover more things, in the context of linked data, it is important to provide information on the semantics of the link. This can be useful for another computer to process data, e.g. automatically provide more information on a resource, or for a human to understand what the link is about, e.g. to display a human readable title about a link, to provide titles in multiple languages, or to point to which definition is used to describe the thing.

Information about links

When following links it can be important to know something on what the link points to upfront. To express commonly used relation types and concepts of links, vocabularies and descriptions are available on the web. For example FOAF and SKOS coming from the RDF world (RDF makes use of vocabulary and reusing vocabularies, published by others, is common practice) and the relation types as defined by IANA (http://bit.ly/hLkROI) for the web in general as used by for example ATOM.

In addition, some information on the resource that the link points to, for further processing the link, can be useful for both machines and humans. For example:

[opsomming]

the media type of the referred resource, so that a computer might process the document accordingly;

the language of the referred resource (e.g. English, Dutch), so that a user knows which language to expect.

Not all formats have good support for links and describing links. Formats like CSV and Shapefiles have no built-in linking capabilities. They can still be useful to some extent though. Linking to an entire file is possible. And if a unique code per row is known, it is still possible to use links to objects. For example, assume that the statistics office provides a URI for statistical units using the NUTS code, then CSV data that includes NUTS codes effectively links to the statistical units, but requires additional knowledge to understand the links.

Other formats and frameworks do have capabilities to express semantics of links more strongly however. In RDF this is the URI of the relation type, e.g. if a link in an information resource of a person is classified as ‘http://xmlns.com/foaf/0.1/knows’, then the link expresses that persons knows another person, identified by the URI. In GML the name of the property element expresses the type, e.g. <app:owner xlink:href=‘…’/> expresses that the ‘app:owner’ property can be found at the location specified by xlink:href. In ATOM this is the value of the rel attribute, e.g. <link href=‘…’ rel=‘previous’/> expresses that a previous version can be found at the location specified by xlink:href.

References

[opsomming, geen koppelingen]

OGC 09-025r1, 2010. OpenGIS Web Feature Service 2.0 Interface Standard Open Geospatial Consortium.

Tom Heath and Christian Bizer (2011). Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.

Uniform resource identifier, Retrieved June 7, 2013, from http://en.wikipedia.org/wiki/Uniform_resource_identifier

Boek/BrentjensEtAl-Ophalen

Inhoud

[bewerken]

HTTP URIs to identify things on the web
[bewerken]

Identifiers
[bewerken]

Closer look at URIs
[bewerken]

[bewerken]

URI as a reference to a description
[bewerken]

More than an identifier
[bewerken]

[bewerken]

Multiple representations[bewerken]

Media types and content negotiation
[bewerken]

Technology for resolving HTTP URIs
[bewerken]

HTTP Basics
[bewerken]

HTTP 303 redirects and 303 URIs
[bewerken]

Testing URIs
[bewerken]

APIs for accessing resources
[bewerken]

Overview
[bewerken]

SPARQL
[bewerken]

WFS
[bewerken]

Nieuwsbrief

Mogelijk gemaakt door

Leden

Boek/BrentjensEtAl-Ophalen

[bewerken]

HTTP URIs to identify things on the web[bewerken]

Identifiers[bewerken]

Closer look at URIs[bewerken]

[bewerken]

URI as a reference to a description[bewerken]

More than an identifier[bewerken]

[bewerken]

Multiple representations[bewerken]

Media types and content negotiation[bewerken]

Technology for resolving HTTP URIs[bewerken]

HTTP Basics[bewerken]

HTTP 303 redirects and 303 URIs[bewerken]

Testing URIs[bewerken]

APIs for accessing resources[bewerken]

Overview[bewerken]

SPARQL[bewerken]

WFS[bewerken]

Nieuwsbrief

Mogelijk gemaakt door

Leden

HTTP URIs to identify things on the web
[bewerken]

Identifiers
[bewerken]

Closer look at URIs
[bewerken]

URI as a reference to a description
[bewerken]

More than an identifier
[bewerken]

Media types and content negotiation
[bewerken]

Technology for resolving HTTP URIs
[bewerken]

HTTP Basics
[bewerken]

HTTP 303 redirects and 303 URIs
[bewerken]

Testing URIs
[bewerken]

APIs for accessing resources
[bewerken]

Overview
[bewerken]

SPARQL
[bewerken]

WFS
[bewerken]