Where needed the data quality of data elements can be improved by data cleansing. Datasets are similar to raw material: they first have to be refined before they become useful. Data cleaning (also referred to as cleansing or scrubbing) describes the process of: fixing errors, transforming and homogenizing formats, aligning inconsistencies in data and metadata, removing duplicate and redundant information, adding lacking information, and making sure the information is up-to-date. One concrete example is the deletion of white spaces and empty cells in a dataset and the identification of missing data. In the data mining literature quite some research has been done on data cleansing, especially in the field of anomaly detection. We will not dive into this field of research in this report but only mention some practical tips: the tools to actually do data cleansing. A wide range of cleansing tools (both commercial as well as open source) can be found on the web. These are a few examples:
De activiteiten van Platform Linked Data Nederland (PLDN) worden mede mogelijk gemaakt dankzij het Kadaster, TNO, Big Data Value Center (BDVC), ECP, Forum Standaardisatie, Kennisnet, SLO, Waternet, Taxonic, MarkLogic, Triply, Franz Inc., SemmTech, Rijksdienst voor het Cultureel Erfgoed (RCE), Beeld en Geluid, EuroSDR, de KVK en ArchiXL
Wilt u op de hoogte gehouden worden van nieuws en ontwikkelingen binnen PLDN?
Schrijf u dan in voor de nieuwsbrief