Data quality assessment

It is of utmost importance to check the quality of the dataset as a whole as well as the individual data elements before moving on in the process of opening data. Based on an extensive literature review done earlier we propose to check the following quality aspects identified by Nousak and Phelps and Knight and Burn for each data instance:

Validity, the extent to which information is correct and reliable.
Completeness, the extent to which information is not missing (e.g. all required data elements are given).
Consistency, the extent to which information is presented in the same format and compatible with previous data, and free from variation and contradiction based on the condition of another data element.
Uniqueness, if the data element is unique, meaning that there are no duplicate values.
Timeliness, the extent to which the information is sufficiently up-to-date.
Accuracy, if the data element values are properly assigned and free of error. And describing the closeness between a value v and a value v’ considered as the correct representation of the reality that v aims to portray.
Preciseness, if the data element is used only for its intended purpose, i.e., the degree to which the data characteristics are well understood and correctly utilized.

The metrics provided in Table 3 can be used to measure the data elements on each of the data quality aspects. We provide one example metrics for each data quality dimension and refer to an extensive list of metrics which can be found in Appendix A. The improvement opportunities column gives an indication of how the quality of the data elements can be improved with respect to that aspect.

Dimension	Metrics	Practical hints
Validity	e.g. no syntax errors	Use validation tools: such as: http://validator.linkeddata.org/vapour
Completeness	e.g. rate of missing concepts/ columns/ elements	Crosschecking, external data acquisition, extension with statistical models, statistical smoothing techniques
Consistency	e.g. usage of homogeneous data types	Check definitions on ambiguity, whether they are self-explanatory, and the use of variation in wording. Are elements being reused to overcome overlaps? Assess if terms and definitions are in line with business vocabulary used in practice, for instance by validating them with business people.
Uniqueness	Bogus: owl: Inverse-Functional Property Values	Detect uniqueness (by using algorithms or tools) and solve consistency issues where needed.
Timeliness	Stating the recency and frequency of data validation	Update and validate the data more frequently.
Accuracy	Number of incidents or malfunctions, comparison with reality, the likelihood that the information extracted from the data is correct, number of outliers, number of semantically incorrect values.	Analysis of consistency and likelihood controls. Meta-data: degree of reliability
Preciseness	The depth of knowledge encoded by the data.	SMART? Also semantic precision not use “name” when only surnames are meant.

The data owner might decide to improve the data quality of data elements that show low quality with respect to one or more of the quality aspects by following the improvement suggestions and/ or by data cleansing. However, this is not required. No matter if the quality of a dataset is high or low, it is always valuable to describe the actual data quality of the dataset in the metadata, e.g. in terms of the data quality aspects described above. This allows users of the dataset to judge if the quality is good enough for their purpose.

Data quality assessment

Nieuwsbrief

Mogelijk gemaakt door

Leden