Example Step 2: Prepare the data

In this step the data is extracted from the core information systems, filtered, anonymized, aggregated and documented.

The selected dataset is a combination of data from different database tables in Liander’s information systems. There is a table containing the standardized annual usage of gas and electricity per household and tables with metadata about the connections, the installed meters and the customers. This data cannot be published as is, because we need to apply the restrictions defined above. Therefore, we create a new database (see table) with a copy of the required data using SQLStructured Query Language (SQL). When copying the data we can already filter for small users with a SQL WHERE-clause. Below is a snapshot of the resulting table. Note that the column names and the data itself are in Dutch. However, even if the data was in English, it contains all sorts of codes, abbreviations and special terminology. Documentation is required to understand the dataWeergave van een feit, begrip of aanwijzing, geschikt voor overdracht, interpretatie of verwerking door een persoon of apparaat.

EAN Postcode Huisnummer Straatnaam Woonplaats Land Product Richting Type SJV_Normaal SJV_Laag Type_Meter
55581503 7231JT 24 ’t Spiker WARNSVELD NL ELK CMB 3x25 3586 DUN
17866103 7231JT 24 ’t Spiker WARNSVELD NL GAS CMB G4 1574 DUN
8662423 7522AV 27 Minister Kuyperplein ENSCHEDE NL ELK LVR 3x25 1399 1499 CVN
15126093 7522AV 27 Minister Kuyperplein ENSCHEDE NL GAS LVR G4 2662 CVN

Now, the data still needs to be anonymized by aggregating and averaging the energy usage for both electricity (ELK) and gas (GAS) per postcode area and removing the EAN codes and house numbers that identify individual consumers. Because we aggregate we cannot simply copy the values in each column. The service direction (RICHTING), for example, can have any of three values: LVR (Levering = consumption), TLV (Teruglevering = production) or CMB (Combination). In this case, it is decided to replace this by the percentage of entries with value “LVR”. For the connection TYPE, we copy the value that occurs most within an area and add a column indicating the percentage of households with this type of connection. The energy usage values, SJV_NORMAAL and DJV_LAAG are added and then averaged over the postcode area. Finally, a new column is added with the number of connections within an area. All these operations should be clearly documented to enable users to interpret the data correctly. A snapshot of the resulting dataset is provided in the table below.

Straatnaam PostcodeVan PostcodeTot Woonplaats Land Product Aantal %Richting %Type Type SJV %Laag %SlimmeMeter
Rijksweg A44 1000AA 1011AB NIEUW VENNEP NL ELK 31 100 29 3x25 16245 38,71 16,13
De Ruyterkade 1011AC 1011AC AMSTERDAM NL ELK 32 100 31 3x25 11433 28,13 15,63
't Spiker 7231JS 7231JT WARNSVELD NL ELK 24 75 54 3x25 3764 41,67 0
't Spiker 7231JS 7231JT WARNSVELD NL GAS 20 100 100 G4 2615 0 0
't Spiker 7231JS 7231JT WARNSVELD NL ELK 16 100 88 1x25 2425 0 0
't Spiker 7231JS 7231JT WARNSVELD NL GAS 16 100 100 G4 1626 0 0

Finally, it is useful to export the data from the database table to a more open format, such as a comma separated (CSV) file.

Go back to example overview