It is of utmost importance to check the quality of the dataset as a whole as well as the individual data elements before moving on in the process of opening data. Based on an extensive literature review done earlier we propose to check the following quality aspects identified by Nousak and Phelps and Knight and Burn for each data instance:
The metrics provided in Table 3 can be used to measure the data elements on each of the data quality aspects. We provide one example metrics for each data quality dimension and refer to an extensive list of metrics which can be found in Appendix A. The improvement opportunities column gives an indication of how the quality of the data elements can be improved with respect to that aspect.
Dimension | Metrics | Practical hints |
---|---|---|
Validity | e.g. no syntax errors | Use validation tools: such as: http://validator.linkeddata.org/vapour |
Completeness | e.g. rate of missing concepts/ columns/ elements | Crosschecking, external data acquisition, extension with statistical models, statistical smoothing techniques |
Consistency | e.g. usage of homogeneous data types | Check definitions on ambiguity, whether they are self-explanatory, and the use of variation in wording.
Are elements being reused to overcome overlaps? Assess if terms and definitions are in line with business vocabulary used in practice, for instance by validating them with business people. |
Uniqueness | Bogus: owl: Inverse-Functional Property Values | Detect uniqueness (by using algorithms or tools) and solve consistency issues where needed. |
Timeliness | Stating the recency and frequency of data validation | Update and validate the data more frequently. |
Accuracy | Number of incidents or malfunctions, comparison with reality, the likelihood that the information extracted from the data is correct, number of outliers, number of semantically incorrect values. | Analysis of consistency and likelihood controls. Meta-data: degree of reliability |
Preciseness | The depth of knowledge encoded by the data. | SMART? Also semantic precision not use “name” when only surnames are meant. |
The data owner might decide to improve the data quality of data elements that show low quality with respect to one or more of the quality aspects by following the improvement suggestions and/ or by data cleansing. However, this is not required. No matter if the quality of a dataset is high or low, it is always valuable to describe the actual data quality of the dataset in the metadata, e.g. in terms of the data quality aspects described above. This allows users of the dataset to judge if the quality is good enough for their purpose.
De activiteiten van Platform Linked Data Nederland (PLDN) worden mede mogelijk gemaakt dankzij het Kadaster, TNO, Big Data Value Center (BDVC), ECP, Forum Standaardisatie, Kennisnet, SLO, Waternet, Taxonic, MarkLogic, Triply, Franz Inc., SemmTech, Rijksdienst voor het Cultureel Erfgoed (RCE), Beeld en Geluid, EuroSDR, de KVK en ArchiXL
Wilt u op de hoogte gehouden worden van nieuws en ontwikkelingen binnen PLDN?
Schrijf u dan in voor de nieuwsbrief