Towards a portal for the verse of the unheard

  • Who to listen to?
    • Rural women poets
    • Poets who spend the bulk of their lives trying to eke out daily wages, yet find time to write
    • University students whose primary focus is a career but who still have an inexplicable urge to write poetry
    • Unlettered poets
  • Guiding rails
    • Aim for the poetic lull behind the noise – the statistical asymptote – of unheard voices; here is where databases, categorization and visualization come in
    • Chime with global and historical well-lit poetry of oppression/dissent:  Latin American;  black (esp. Africa and US); feminist; Bhakti etc.
    • Work with mainstream poets, esp. regional ones, without highlighting their work
    • Along the lines of A.K Ramanujan’s collection of Indian folktales and P.J Sainath’s people’s archive of rural India but for verse
  • Where to look for?
    • Rural spaces where marginalized poets speak
    • Universities
    • Katchi abadis
    • The unlit mushaira/gathering of poets
  • Who to work with?
    • Regional poets with social justice sensibilities
    • Social justice activists with poetic sensibilities
    • Academics with social justice and poetic leanings
  • Challenges
    • Creating effective quality filters which are open to the myriad voices without damping them down
    • Identify the spaces in ‘Where to look for?’
  • Nuts & bolts
    • Blog/wiki for poetry collection
    • A platform for data collection via browsers and phones using the tools I have built (if blog or wiki is not up to the task)
    • An incremental portal that learns its categories from the data collected

New Visualization/Portal Ideas

  • Bhakti, Race & Gender: black poets esp. US (Audre Lorde, Gwendolyn Brooks), feminist poetry; early South Indian movements/saints/poets all the way till fifteenth century North Indian Kabir, try and put their work Indian/regional/global historic context, esp. that of movements and dissent
  • Towards a portal for the verse of the unheard
  • Visualize Hamza Alavi’s work
  • Making complex math and physics concepts accessible to 12 year olds
  • Medical/public health portals in regional languages

Data preparation for HDI vs. IFF visualization

After coming across this Guardian article on Illicit Financial Flows by Jason Hickel, I thought of doing a visualization on the topic. His 2014 article on “Flipping the Corruption Myth” was part of my education in leaning towards social justice. I had written to him then as well to see if we could collaborate on some visualization project, and there was only silence.

Since had this PDF on their website, a visualization seemed doable as there are a number of tables in the appendix. I played around with a few ideas for the visual and finally settled down to using the same template that I have used for a number of other visuals (HDI vs. SPI, Sindh Health and Literacy Comparison, & Pakistan HDI vs Election Results). This type of visual uses the general switch.js Javascript program which I will describe in another post.

This is the end result: Africa: Human Development Index (HDI) versus Illicit Financial Flows (IFF)

Scraping table data from PDF to spreadsheet

Over the years, I have used a couple of ways of extracting table data from PDFs. Using Acrobat Acrobat is one. I have also used online programs to do the same for quick conversions. This time I found a Java based user interactive program called Tabula which after downloading and trying out, seemed to work fine. I won’t go into the details of using the program as it is pretty straightforward (it did require a bit of trial and error to get the tables extracted in CSV format, but nothing that requires elaboration).

Once I had the tables in CSV format, I had to convert the country names to codes since the Africa map in SVG format I use in the visual uses 3 digit alphanumeric ISO codes (there was a fair amount of preprocessing for the SVG file as well after downloading a suitable Africa map in SVG from the web; that is deferred for another post). The 3 digit alpha codes can be collected from any number of places and the Wikipedia page is as good as any.

To get the country codes in the spreadsheet, matching has to be done by country names which is never an exact process, so a little bit of manual matching has to be done. The final result is here (note we need only Table 2 of this for our visual. I also wrote an R script to take data from Table 4 and 5 into a couple of CSV files ready to be imported to a MySQL table, but that was not needed for the visual that I decided to do. It was a fun data conversion exercise though).

I then used HDI data downloaded from the UNDP site (containing 2015 estimates based on 2014 data). Again we only need the first worksheet from this dataset for our visual. A little cleaning and country coding of HDI data and combining it with Table 2 from GFI, and we  get the final result needed for our visual.

Improving existing visualizations

Aman gave some feedback on improving the existing visualizations on….

Aman: I’ve just copied-pasted the content of my e-mail here and we can take it from here. To figure out which tasks are more urgent / juicy / “low-hanging fruit”…
Broadly, I divided my comments into those concerning the phenomena being considered (which, how, time periods, space – where , who and to whom), and on the other hand, comments on visualization, suggesting how to make certain interactions more intuitive or informative.


  • Political timeline:
    • New event categories in political timeline: tax policy, international trade agreements, aid and national debt (foreign and internal), industrial policy, environmental policy, military procurement and recruitment, labour policy, energy policy, “agitational” events (by which I mean things like labour strikes and factory occupations, “payya-jaam” attempts by parties and movements, marches, rallies and dharnas (again, by parties and movements) )
    • Adding event data to the political timeline data file
    • Augmenting the information about the event: links to multimedia, recommended reads (likely to require some tweaking to the way the event info is presented)
    • Event scoring: adding components to the scoring  (could be: economic significance, change in power relationship between rulers and ruled, change in relative strength of military and civilian power, change in balance between major national institutions, constitutional rights (freedom of speech, association, religion, checks on abuse of power by law enforcement institutions), socio-economic rights); some components may apply only to certain event categories; interface allows user to modify weights of components and final score is normalised sum of products.
  • Sindh Health and Literacy Comparision by District
    • Health workers per capita to the information on each district
    • Is data on numbers on midwives and trained “traditional birth attendants” (daiees) available somewhere? Shirkatgah or FPAP might have compiled such figures.
    • Related to the health rubric: mortality rates, prevalence of indicative diseases, some measure of transport availability, the statistics bureau report doesn’t have figures on nutrition/food consumption (only production) which is a pity,
  • Slums
    • Migration by reason: somehow it seems more intuitive to me to have the reasons on the left and the destinations on the right (unless they’re places of origin and not destination?)
    • Migration by place of birth: Punjab, Sindh and NWFP all have flows where the place of birth is the same as the province. Could this be a data entry error? Or maybe there’s something I’m missing.
    • Evictions and guttings: it would be so revelatory to have data on the characteristics of the projects and populations that replaced the evicted people. Kind of project (residential, commercial, military and/or governmental), income levels of new residents, new population density, new resource consumption levels, political tendencies of new population. But that would require some serious survey/report-trolling work.
  • South Asian vulnerabilities
    • Ecological indicators (forest cover, endangered animal and plant species, frequency, intensity and extent of flooding, water-carrying capacity of major rivers, encroachment of the sea into river deltas, extremes of temperature, precipitation and humidity, frequency of storms and cyclones)
  • Transport indicators (by country and province, by mode of transport): some measure of network density, passengers carried, km per capita, road vehicles per capita, cost / km / person on most heavily used routes, average travel time on most heavily used routes, percentage of population in areas with least service
  • Indexing Naked Punch articles (!)


  • Political timeline: possibility to select multiple categories, possibility to use a slider to specify the time window to filter events.
  • South Asian vulnerabilities: ability to flip the variables and the countries, so that one may choose multiple variables (as long as the scales are compatible) for one country
  • More work like this awesome visualization, i.e., linking phenomena.
  • Really advanced: animate spatial visualizations to depict changes over time, happening across regions and continents!


NakedPunch visual navigation

This process was started some time ago, hit some snags and now back on track. The idea was to categorize the articles and create a visual navigation page.

This Python script scrapes the data off the NakedPunch site and outputs to the terminal a ‘%’ separated text file with the URL, Title, Author and Blurb columns. Most of the work is done by the BeautifulSoup library.


import requests
from BeautifulSoup import BeautifulSoup

prefix = ''

print 'url%title%author%blurb'           

for page in range(1, 20):
    url = prefix+"?page="+str(page)
    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html)
    alldivs = soup.findAll('div', 'article-summary')

    for div in alldivs:
        url = div.h4.a['href'].encode('utf8').strip()
        title = div.h4.text.encode('utf8').strip()
        author = div.div.text[3:].encode('utf8').strip() # remove leading 'by '
        blurb = '"'+div.p.text.encode('utf8').strip()+'"'
        print "%".join([url, title, author, blurb])

Each entry could then be tagged…