In this step the data is extracted from the core information systems, filtered, anonymized, aggregated and documented.
The selected dataset is a combination of data from different database tables in Liander’s information systems. There is a table containing the standardized annual usage of gas and electricity per household and tables with metadata about the connections, the installed meters and the customers. This data cannot be published as is, because we need to apply the restrictions defined above. Therefore, we create a new database (see table) with a copy of the required data using SQLStructured Query Language (SQL). When copying the data we can already filter for small users with a SQL WHERE-clause. Below is a snapshot of the resulting table. Note that the column names and the data itself are in Dutch. However, even if the data was in English, it contains all sorts of codes, abbreviations and special terminology. Documentation is required to understand the dataWeergave van een feit, begrip of aanwijzing, geschikt voor overdracht, interpretatie of verwerking door een persoon of apparaat.
Now, the data still needs to be anonymized by aggregating and averaging the energy usage for both electricity (ELK) and gas (GAS) per postcode area and removing the EAN codes and house numbers that identify individual consumers. Because we aggregate we cannot simply copy the values in each column. The service direction (RICHTING), for example, can have any of three values: LVR (Levering = consumption), TLV (Teruglevering = production) or CMB (Combination). In this case, it is decided to replace this by the percentage of entries with value “LVR”. For the connection TYPE, we copy the value that occurs most within an area and add a column indicating the percentage of households with this type of connection. The energy usage values, SJV_NORMAAL and DJV_LAAG are added and then averaged over the postcode area. Finally, a new column is added with the number of connections within an area. All these operations should be clearly documented to enable users to interpret the data correctly. A snapshot of the resulting dataset is provided in the table below.
|Rijksweg A44||1000AA||1011AB||NIEUW VENNEP||NL||ELK||31||100||29||3x25||16245||38,71||16,13|
Finally, it is useful to export the data from the database table to a more open format, such as a comma separated (CSV) file.