After coming across this Guardian article on Illicit Financial Flows by Jason Hickel, I thought of doing a visualization on the topic. His 2014 article on “Flipping the Corruption Myth” was part of my education in leaning towards social justice. I had written to him then as well to see if we could collaborate on some visualization project, and there was only silence.
This is the end result: Africa: Human Development Index (HDI) versus Illicit Financial Flows (IFF)
Scraping table data from PDF to spreadsheet
Over the years, I have used a couple of ways of extracting table data from PDFs. Using Acrobat Acrobat is one. I have also used online programs to do the same for quick conversions. This time I found a Java based user interactive program called Tabula which after downloading and trying out, seemed to work fine. I won’t go into the details of using the program as it is pretty straightforward (it did require a bit of trial and error to get the tables extracted in CSV format, but nothing that requires elaboration).
Once I had the tables in CSV format, I had to convert the country names to codes since the Africa map in SVG format I use in the visual uses 3 digit alphanumeric ISO codes (there was a fair amount of preprocessing for the SVG file as well after downloading a suitable Africa map in SVG from the web; that is deferred for another post). The 3 digit alpha codes can be collected from any number of places and the Wikipedia page is as good as any.
To get the country codes in the spreadsheet, matching has to be done by country names which is never an exact process, so a little bit of manual matching has to be done. The final result is here (note we need only Table 2 of this for our visual. I also wrote an R script to take data from Table 4 and 5 into a couple of CSV files ready to be imported to a MySQL table, but that was not needed for the visual that I decided to do. It was a fun data conversion exercise though).
I then used HDI data downloaded from the UNDP site (containing 2015 estimates based on 2014 data). Again we only need the first worksheet from this dataset for our visual. A little cleaning and country coding of HDI data and combining it with Table 2 from GFI, and we get the final result needed for our visual.