I (--Pieter (overleg) Jan 29 15:10 (CET)) have updated the content of this wiki page with the topics we have discussed during the LOD2 afternoon session at the PiLOD event at the VU in Amsterdam.
Section 2 has been updated with the names of the participants of the session and section 3 has been updated with the feedback we got on our questions. Section 5 has been added with the other questions that were asked during the session.
Feel free to add or change any content on this wiki page if you feel that any relevant information is missing or wrong.
We are currently setting up a PiLOD Data Platform for research, test and development purposes to support the themes and cases we are working on within the PiLOD context.
Additionally we would like to offer a publishing location for linked datasets that cannot be published via other data portals, the so-called orphan linked data sets. Data owners can then publish and maintain their linked data sets themselves with the help of linked data experts.
We have investigated our preferences in tooling and formats when it comes to open, linked and big data and we would like to work with e.g.:
See the next overview with clusters of functionalities, formats and tools (in Dutch).
Figure 1: PiLOD Data Platform Overview (source: Freshheads)
A second overview we have produced, is a Front End - Back End schema (in Dutch), which is another view on this topic looking at the different data sources, the required functionalitities of the platform to support full data cycles and what applications developers would like to create in the Frond End.
Figure 2: PiLOD Data Platform Schema (source: Gerard Persoon)
For the linked data part we have in our community experience with Google Refine, D2R, Topbraid Composer, Allegrograph, Virtuoso, Silk and other tools and we are looking for ways to broaden and deepen our “LOD literacy” with the experiences that are available in other data communities.
An important requirement in the selection of any tooling is that the software is fully open source and that it has an active and credible community, which uses and maintains the tooling, also in relation to other tooling that could be part of a Data Platform stack.
We are now building our PiLOD Data Platform from scratch, but we are very keen to find out, what we can do to reuse best practices and insights from existing data platforms, software stacks and data communities that could save us time, energy and money.
Hence our interest in the LOD2 stack that could possibly give us a jumpstart for our linked data activities and tooling requirements if we could reuse the parts of the LOD2 stack that would fit our Platform requirements. LOD2 has a lot of usable material available.
The following text has been updated with the information we got from the session:
The following LOD2 experts were present at the session:
And their organization's websites:
And the following PiLOD members:
Our questions regarding LOD2 fall into the following categories:
Q: What can we learn from the LOD2 experiences in setting up a stack and platform that could save us time, energy and money.
A: We did talk about this as a separate topic during the afternoon session.
Q: We would like to use Hadoop for raw data. How can we make that work with the LOD2 Stack?
A: Hadoop is not part of the LOD2 stack. The components of the LOD2 stack comply to the W3C LOD standards and Hadoop does not comply to those standards. However Hadoop is used in Elsevier LOD2 environment in relation to Virtuoso. You might also want to use the Unified Views ETL tool or a SPARQL mapper to take Hadoop data to an LOD2 environment. And a MongDb (JSON) export to Virtuose can also be build. See also e.g.:
Q: We would like to make the data in our triple stores more accessible via domain-specific APi’s that we can use SPARQL queries with parameters in the front end. For instance, for the Base Administration Adresses and Buildings (the BAG registry) we would like to have pre-formatted Domicile (Dutch: woonplaats), Residency (Dutch: verblijfplaats) and Premises (Dutch: pand) SPARQL queries. Does LOD2 have any experience with scenarios like these, and do you know of any best practices we could look at to investigate this topic in further detail?
A: LOD2 does not contain such constructs, but it can become a best practices to build such API's on top of a triple store like Virtuoso. Also use context-sensitive mappings that you only show relevant details in the front end given your target audience of users. The LOD2 Stack does also not contain Google's Linked Data API (LDA) or the ELDA variant. But it would be good to look at these kinds of solutions in further detail. See also:
Q: We would like to use Elasticsearch for searching all kinds of documents. How can we make that work with the LOD2 Stack? And we would also like to have facetted search?
A: Search will be further improved within the LOD2 Stack with a Solr-based solution. You can also look at SIREn. See also:
Q: There is no index for RDF endpoints. Given an app that uses multiple data sources, for instance a tourism app for the town of Rotterdam, how can you make the app to work properly that it knows which RDF endpoints to access? What are the LOD2 experience with this topic and do you know of any best practices?
A: We did not discuss this topic during the session
Q: How can you make two stacks to work properly with each other? For instance the LOD2 Stack with the PiLOD Data Platform, which uses additional tooling. Would you need an index on top of the two stacks? And what else? Also from a developers point of view we would like to develop read-write interfaces using JSON-LD and maybe with tools like the Apache Marmotta and/or Callimachus (these tools are on our list for further investigation). And we also look at the W3C LOD best practices and the activities within the Linked Data Platform (LDP) Workgroup. LOD2 takes a data cycle as the anchoring mechanism to position the LOD2 software components in a logical structure, but what if you take an application lifecycle as starting point with the related tools. Can we make (or reuse when available) an overview then of how the LOD publishing, finding and consuming parts would meet each other in a new structure?
A: LOD2 is an open stack with mostly open source components that can be integrated with other tools.
Note: The administrator topics were not discussed during the session.
No base registrations, we are not the official publication portal for base registrations, we might use copies or subsets from base registrations for development and testing purposes. We might give owners of orphan linked datasets a ‘home’ to publish their datasets in our environment (e.g. Energy labels that are available as linked data from the VU, but that should be maintained by their owners).
The list of tools we have determined so far during our PiLOD Data Platform session in Tilburg or would we like to extend this list with tooling from the LOD2 Stack if it fits our requirements?
What login policy? Login and/or 2-factor authentication? What user/administrator credentials? What are the LOD2 experiences with security? Etc.
Who are allowed to install software packages? What are our policies regarding compiling software? Which administrators need root access? Where do we locate documentation on the platform?
What kind of administrators do we need for PiLOD? What minimal number of administrators do you need? What are the LOD2 experiences on this topic? Which PiLOD participants want to do administrator tasks?
GUI and/or Unix prompt? What is used by LOD2 Stack administrators for the LOD2 Stack implementations in e.g. Austria.
Q: To get more acquainted with the LOD2 Stack we would like to make use of a sandbox environment that we can make founded decisions on what tooling would fit with our requirements and which LOD2 packages we would need then.
A: We did not discuss this topic during the session
We have prioritized the LOD2 questions, given the available time during the afternoon session, as follows:
The administrator questions can be limited to best practices on how to set up platform administration in such a way that it can be efficient as possible and least time consuming.
During the afternoon session several questions were asked that we did not prepare.
Q: When we convert normalized data in relational databases to RDF we get an explosion of URI's that we don't want. One object gets many URI's given the details we would like to know from him including the changes that have happened over time (from-to dates/time stamping). What would be a best practices you recommend?
A: There is not one answer to that, but not every record field in a database should get a URI. In general it is good practice to break open structures and link them to make them more accessible. In any case you create a machine readable version of your database (documentation in RDF). This is a subject for further investigation. See also:
The text in this section also refers to some of the remarks that Phil Archer made during his presentation.
Q: Does all data needs to become linked data?
A: No, not all data need to become linked data. Use linked data when it is needed for the complexity you are dealing with. In some cases we are just trying to solve a problem where linked data is the only possible solution to solve that. We have no discussion on business cases then. See also:
Q: We expect to have some very large triple stores for some of the cases we are working on. What about the scalability of the LOD2 Stack components?
A: The Viruoso 7 triple store is built for scalability. It supports a clustered approach and uses various optimization techniques. See also:
Q: Does the LOD2 stack offer any data quality services?
A: No, it is a good practice to improve the quality of data via crowdsourcing (feedback loop). E.g. Point of Interest (POI) information in Austria with the wrong height information was corrected by users of that data. Let people query your data to find anomalies in e.g. DBpedia articles and if they find any let them report that to you and stimulate them to find more anonomalies (if any) in related articles that might be of interest to them.
DBpedia has created a lot of sample queries that people can easily start building their own SPARQL queries.
Q: Does the LOD2 Stack offer any link extracting tools like we use for our legislation data to find references to articles.
A: No, that is not part of the LOD2 Stack, but PiLOD can contact John Sheridan in the UK who is working on a similar project within the EU-context.
Q: Are there any applications built with the LOD2 Stack similar to the "House Safe" (Dutch: Huiskluis).
A: No, the cadastral data is not open in Austria. But in general the city of Vienna is the city to watch when it comes to linked open data. They are the most active city in Austria when it comes to publishing datasets and developing applications. See the Austrian Open Government Data portal for more details:
LOD2 delivers horizontal technology that can be used for a large number of vertical cases.
Q: What will happen after June 2014 when the LOD2 program ends?
A: The EU-LOD2 program is working with several partners to keep the LOD2 community alive and to assure continuity in support for the LOD2 Stack after June 2014 when the LOD2 program ends. In any case support for the LOD2 components will be delivered by the LOD2 consortium partners for their specific components.
The LOD2 Stack becomes the Linked Data Stack in 2014. See also:
Q: Where can we drop questions?
A: You can drop questions to the LOD2 team via the LOD2 site for e.g. new tools that you would like to add to LOD2 or you can contact your contacts at one of the LOD2 consortium partners directly for specific support questions.
Q: What support is available on the LOD2 Stack?
A : PUBLINK is a free LOD2 consultancy service backed up up by the LOD2 consortium partners, to help organizations to start working with the LOD2 Stack.
For us it would be very helpful to make use of this offer to get our platform to the next level of maturity with tools like Virtuoso 7, OntoWiki and the experiences in the LOD2 community with Hadoop in relation the Elsevier LOD2 activities.
These tools are prominent in our discussions when we talk about our PiLOD Data Platform and when we talk about development activities.
JSON-LD is a lightweight Linked Data format. It is easy for humans to read and write. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale. JSON-LD is an ideal data format for programming environments, REST Web services, and unstructured databases such as CouchDB and MongoDB.
Semantic MediaWiki (SMW) is an extension to MediaWiki that allows for annotating semantic data within wiki pages, thus turning a wiki that incorporates the extension into a semantic wiki. Data that has been encoded can be used in semantic searches, used for aggregation of pages, displayed in formats like maps, calendars and graphs, and exported to the outside world via formats like RDF and CSV.
Semantic Web Company (SWC) is a technology provider headquartered in Vienna (Austria). SWC supports organizations from all industrial sectors worldwide to improve their information management. Our core product PoolParty has outstanding capabilities to extract meaning from big data by making use of linked data technologies.
The LOD2 Stack has been renamed to the Linked Data Stack in 2014. The LOD2 Stack comprises a number of tools for managing the life-cycle of Linked Data. The life-cycle comprises in particular the stages: 1) Extraction of RDF from text, XML and SQL,2) Querying and Exploration using SPARQL, 3) Authoring of Linked Data using a Semantic Wiki, 4) Semi-automatic link discovery between Linked Data sources, 5) Knowledge-base Enrichment and Repair
Een application programming interface (API) is een verzameling definities op basis waarvan een computerprogramma kan communiceren met een ander programma of onderdeel (meestal in de vorm van bibliotheken). Vaak vormen API's de scheiding tussen verschillende lagen van abstractie, zodat applicaties op een hoog niveau van abstractie kunnen werken en het minder abstracte werk uitbesteden aan andere programma's. Hierdoor hoeft bijvoorbeeld een tekenprogramma niet te weten hoe het de printer moet aansturen, maar roept het daarvoor een gespecialiseerd stuk software aan in een bibliotheek, via een afdruk-API.
Linked data offers a set of best practices for publishing, sharing and linking data and information on the web. It is based on use of http URIs and semantic web standards such as RDF.
For some web developers the need to understand the RDF data model and associated serializations and query language (SPARQL) has proved a barrier to adoption of linked data. This project seeks to develop APIs, data formats and supporting tools to overcome this barrier. Including, but not limited to, accessing linked data via a developer-friendly JSON format.
Resource Description Framework (RDF) is een standaardmodel voor gegevensuitwisseling op het web. RDF heeft functies die het samenvoegen van gegevens vergemakkelijken, zelfs als de onderliggende schema's verschillen, en het ondersteunt specifiek de evolutie van schema's in de loop van de tijd zonder dat alle gegevensgebruikers moeten worden gewijzigd.
DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. We hope that this work will make it easier for the huge amount of information in Wikipedia to be used in some new interesting ways. Furthermore, it might inspire new mechanisms for navigating, linking, and improving the encyclopedia itself.
De activiteiten van Platform Linked Data Nederland (PLDN) worden mede mogelijk gemaakt dankzij het Kadaster, TNO, Big Data Value Center (BDVC), ECP, Forum Standaardisatie, Kennisnet, SLO, Waternet, Taxonic, MarkLogic, Triply, Franz Inc., SemmTech, Rijksdienst voor het Cultureel Erfgoed (RCE), Beeld en Geluid, EuroSDR, de KVK en ArchiXL
Wilt u op de hoogte gehouden worden van nieuws en ontwikkelingen binnen PLDN?
Schrijf u dan in voor de nieuwsbrief