Merge CSV headers in R

Aug 8, 2019

Usually when you work with data from Public Data Services like INE (Spanish Statistical Office) you have to deal with a Excel files with a non usable format: metadata in the first rows, notes at the bottom, empty columns, empty rows, etc. You have to do a small data processing in order to start working.

Everyone knows the benefits of the CSV format, but precisely these metadata are there for one reason: unlike CSV where we only have the data itself, if we open this file within a few months this metadata will let you know exactly what data it contains or where it comes from. For this reason, I think that sometimes it is not a bad idea that the first phase of working with data is reading directly from an Excel format (instead of CSV).

For these cases I usually use R due to its reproducible workflow and of course because of the whole universe of packages, specially Tidyverse, an opinionated collection of R packages designed for data science.

Well, days ago I had to deal with a dataset that had a format with a problem I didn't face before, it had a double header like the ones shown in the following picture:


Capture from the Spanish Statistical Office site. Unemployment rate by ages and year quarters.

I spent hours searching the web looking for a solution but I didn't find any. I finally opted for create a function to clean this dataset.

My goal was to merge both rows and ended up with a column names as Menores de 25 años_2019T2, Menores de 25 años_2019T1 or Menores de 25 años_2018T4, i.e. combine the two rows with an underscore. I will show step by step a case in which this function could be used. If you prefer, you can go directly to the full script.

First of all, I read a xlsx file with the amazing read_xlsx from Tidyverse and skip the first 6 lines.

The function itself:

I am completely sure that there must be a cleaner way or an R package that solves this problem but I have been unable to find it.

Basically, what combineHeaders does is to store in the variable types those column names of the first row in a vector (discarding those column names that are numbers, as R / RStudio puts by default).

Then iterates through all the column names in the first row (the original one from which we have extracted types), creates a default index (variable z) initialized to 1 and creates a variable called type that points to the first element of types through z value.

In this iteration, it assigns the first element of the types vector (the value of the type) to the first elements. If a column name is equal to the next types element, it increases the value of z by one and updates type.

Then I assign the vector result of the funcion as the new csv header and apply some dplyr's magic pipes for cleaning the data, renaming the first column and create a couple of variables from the first column.


Capture of the data frame cleaned.

Then I convert all the data frame to long format, divide the previous headers into two new variables using separate and parse temporal variables to a date format:


Capture of the data frame tidied_data.

And finally a plot using ggplot. The code

And the result:


Capture of the final plot.