Data Cleansing

Where needed the data quality of data elements can be improved by data cleansing. Datasets are similar to raw material: they first have to be refined before they become useful. Data cleaning (also referred to as cleansing or scrubbing) describes the process of: fixing errors, transforming and homogenizing formats, aligning inconsistencies in data and metadata, removing duplicate and redundant information, adding lacking information, and making sure the information is up-to-date. One concrete example is the deletion of white spaces and empty cells in a dataset and the identification of missing data. In the data mining literature quite some research has been done on data cleansing, especially in the field of anomaly detection. We will not dive into this field of research in this report but only mention some practical tips: the tools to actually do data cleansing. A wide range of cleansing tools (both commercial as well as open source) can be found on the web. These are a few examples:

Open Source:
- Spreadsheet: http://schoolofdata.org/handbook/recipes/cleaning-data-with-spreadsheets/
- Google Refine now Open Refine: https://github.com/OpenRefine/OpenRefine/wiki/Getting-Started
- http://datacleaner.org/
Commercial Tools:
- Trifacta.com based on Wrangler (http://vis.stanford.edu/wrangler/)
- Data Ladder

Data Cleansing

Nieuwsbrief

Mogelijk gemaakt door

Leden