Aug 8, 2019
Usually when you work with data from Public Data Services like INE (Spanish Statistical Office) you have to deal with a Excel files with a non usable format: metadata in the first rows, notes at the bottom, empty columns, empty rows, etc. You have to do a small data processing in order to start working.
Everyone knows the benefits of the CSV format, but precisely these metadata are there for one reason: unlike CSV where we only have the data itself, if we open this file within a few months this metadata will let you know exactly what data it contains or where it comes from. For this reason, I think that sometimes it is not a bad idea that the first phase of working with data is reading directly from an Excel format (instead of CSV).
For these cases I usually use R due to its reproducible workflow and of course because of the whole universe of packages, specially Tidyverse, an opinionated collection of R packages designed for data science.
Well, days ago I had to deal with a dataset that had a format with a problem I didn't face before, it had a double header like the ones shown in the following picture:
I spent hours searching the web looking for a solution but I didn't find any. I finally opted for create a function to clean this dataset.
My goal was to merge both rows and ended up with a column names as
Menores de 25 años_2019T2,
Menores de 25 años_2019T1 or
Menores de 25 años_2018T4, i.e. combine the two rows with an underscore. I will show step by step a case in which this function could be used. If you prefer, you can go directly to the full script.
First of all, I read a xlsx file with the amazing
read_xlsx from Tidyverse and skip the first 6 lines.
The function itself:
combineHeaders does is to store in the variable
types those column names of the first row in a vector (discarding those column names that are numbers, as R / RStudio puts by default).
Then iterates through all the column names in the first row (the original one from which we have extracted
types), creates a default index (variable
z) initialized to
1 and creates a variable called
type that points to the first element of
In this iteration, it assigns the first element of the
types vector (the value of the
type) to the first elements. If a column name is equal to the next
types element, it increases the value of
z by one and updates
Then I assign the vector result of the funcion as the new csv header and apply some dplyr's magic pipes for cleaning the data, renaming the first column and create a couple of variables from the first column.
Then I convert all the data frame to long format, divide the previous headers into two new variables using
separate and parse temporal variables to a date format:
And finally a plot using ggplot. The code
And the result: