Automated text categorization using Python: Links

Towards a portal for the verse of the unheard

  • Who to listen to?
    • Rural women poets
    • Poets who spend the bulk of their lives trying to eke out daily wages, yet find time to write
    • University students whose primary focus is a career but who still have an inexplicable urge to write poetry
    • Unlettered poets
  • Guiding rails
    • Aim for the poetic lull behind the noise – the statistical asymptote – of unheard voices; here is where databases, categorization and visualization come in
    • Chime with global and historical well-lit poetry of oppression/dissent:  Latin American;  black (esp. Africa and US); feminist; Bhakti etc.
    • Work with mainstream poets, esp. regional ones, without highlighting their work
    • Along the lines of A.K Ramanujan’s collection of Indian folktales and P.J Sainath’s people’s archive of rural India but for verse
  • Where to look for?
    • Rural spaces where marginalized poets speak
    • Universities
    • Katchi abadis
    • The unlit mushaira/gathering of poets
  • Who to work with?
    • Regional poets with social justice sensibilities
    • Social justice activists with poetic sensibilities
    • Academics with social justice and poetic leanings
  • Challenges
    • Creating effective quality filters which are open to the myriad voices without damping them down
    • Identify the spaces in ‘Where to look for?’
  • Nuts & bolts
    • Blog/wiki for poetry collection
    • A platform for data collection via browsers and phones using the tools I have built (if blog or wiki is not up to the task)
    • An incremental portal that learns its categories from the data collected

hm.Survey: data capture tool – how & why


I have been doing survey automation systematically since 2000 on both regular computers and hand-helds (translates into cell phones and tablets today). More recently, I have generalized that bit of automation in a tool called hm.Survey, a template based data entry tool that allows relatively quick generation of data entry programs and then tabulating the data.

The content under hm.Survey  is in the process of being tidied up: moving from Google Docs to a Wiki, updating documents, clarifying text and removing obsolete pages. This introduction will give a background to that work.

The lead up to hm.Survey

Before hm.Survey, there was another program, written in Delphi in Riyadh (in 2005 where there was so little to do that rather than twiddle thumbs, I thought it best to write a program to automate the table production process instead of doing it  by hand in SQL). I dubbed it hm.SurveyReport. It took an XML document as input and produced tabulated reports as output. The XML document described the fields to be used in each report which allowed the Delphi program to construct the SQL statements and carry them out. It was all very nice and dandy except of course that Delphi – the flagship product of the once popular Borland company, synonymous with top-notch compilers for C, C++ and Pascal – was dying off. I was using Delphi in all my survey programs at one point.

Delphi’s slow death prompted me to start using DotNet (C#Express and SQL Server) for my survey data entry and reporting work in 2006 in Lebanon. And this was because the Ministry of Industry there already had a Microsoft license for SQL Server. I pretty much rewrote the earlier Delphi data entry programs for Lebanon and reused the same later for Tanzania and Iraq (there was a bit of generalization here as well but not of the the same order as the one I carried out for hm.SurveyReport, i.e., for reports).

In parallel, I had been working on a data entry/indexing system for inflation, first in Delphi (Paradox DB, later MySQL), then in Java (desktop; database was MySQL; it even has a servlet based analysis engine which does all sorts of sophisticated regressions but has never been used). The Java version also had a handheld counterpart in the form of a Java/MIDP (now obsolete in favor of Android and iOS) program that could run on cheap Nokias (and ran equally well on expensive ones).

To keep work worthwhile, I wanted to generalize the survey program and its accompanying tabulations along the lines of the specification based reporting work done earlier (on Delphi) but on a platform that was less ephemeral. Using Java was an option (but never Microsoft), but considering how the micro version (Java/MIDP) fell out of favor with the market, I could not find any compelling reason why other versions of Java would not follow suit (plus Oracle acquired Sun and hence Java, thus killing any remnant of an ideological bias for Java).

Hence the idea of specifications for data entry and reports that could reside in text files. These specifications would ‘trigger’ instances of programs rooted in platforms that weathered the vagaries of the marketplace. But if and when the platforms become obsolete, the specifications would not, and new instances could be spawned. That at least was the idea.

The Realization

I implemented this idea when the Iraq project was finishing up. By the time I wrote the programs (online data entry & tabulation with an option for using Androids for data entry) for a seasonal survey that was never implemented, the official UNIDO as well as Iraqi commitment to the project was falling off. So I salvaged what I could and built upon that. This was the first quarter of 2012.

It was in Oman in the third quarter of 2013 that I could put these ideas to use. Today, after three+ years of using the program there and another installation in Laos recently, two instances have been realized and tested, a number of significant changes have been made, and new things are being proposed.

So that was the how & why. In terms of the technologies used, the data entry clients are browser based, the ‘canonical’ server uses Python scripts running on top of an Apache web server on Ubuntu (I have implemented a Windows port using Microsoft SQL Server, but that has not been tested; also a Virtual Machine implementation on Windows running Ubuntu is operational in Laos). Two databases have been tested thoroughly: MySQL and Oracle. Underlying it all, the constants are specification files in text (JSON) format that describe the questionnaire, the validation checks and the output reports including the formulas used in them.

Data preparation for HDI vs. IFF visualization

After coming across this Guardian article on Illicit Financial Flows by Jason Hickel, I thought of doing a visualization on the topic. His 2014 article on “Flipping the Corruption Myth” was part of my education in leaning towards social justice. I had written to him then as well to see if we could collaborate on some visualization project, and there was only silence.

Since had this PDF on their website, a visualization seemed doable as there are a number of tables in the appendix. I played around with a few ideas for the visual and finally settled down to using the same template that I have used for a number of other visuals (HDI vs. SPI, Sindh Health and Literacy Comparison, & Pakistan HDI vs Election Results). This type of visual uses the general switch.js Javascript program which I will describe in another post.

This is the end result: Africa: Human Development Index (HDI) versus Illicit Financial Flows (IFF)

Scraping table data from PDF to spreadsheet

Over the years, I have used a couple of ways of extracting table data from PDFs. Using Acrobat Acrobat is one. I have also used online programs to do the same for quick conversions. This time I found a Java based user interactive program called Tabula which after downloading and trying out, seemed to work fine. I won’t go into the details of using the program as it is pretty straightforward (it did require a bit of trial and error to get the tables extracted in CSV format, but nothing that requires elaboration).

Once I had the tables in CSV format, I had to convert the country names to codes since the Africa map in SVG format I use in the visual uses 3 digit alphanumeric ISO codes (there was a fair amount of preprocessing for the SVG file as well after downloading a suitable Africa map in SVG from the web; that is deferred for another post). The 3 digit alpha codes can be collected from any number of places and the Wikipedia page is as good as any.

To get the country codes in the spreadsheet, matching has to be done by country names which is never an exact process, so a little bit of manual matching has to be done. The final result is here (note we need only Table 2 of this for our visual. I also wrote an R script to take data from Table 4 and 5 into a couple of CSV files ready to be imported to a MySQL table, but that was not needed for the visual that I decided to do. It was a fun data conversion exercise though).

I then used HDI data downloaded from the UNDP site (containing 2015 estimates based on 2014 data). Again we only need the first worksheet from this dataset for our visual. A little cleaning and country coding of HDI data and combining it with Table 2 from GFI, and we  get the final result needed for our visual.

To Dos

  • Automatic notification on posts: Basic WordPress reserves its automated notification for comments only. For posts, I found some plugins and tried the most popular one called Notification, but it gave an error on activation, so I removed it (had to remove the folder using FTP). So we still have to figure out a way to get notified on postings – at least the admins.
  • We also need to experiment with privacy settings, making either the entire blog private (for which a plugin in required) or just individual pages and posts (which core WordPress allows). Here is the official word on Content Visibility.

ramping up to programming

    • Scratch :Scratch helps young people learn to think creatively, reason systematically, and work collaboratively — essential skills for life in the 21st century.Scratch is a project of the Lifelong Kindergarten Group at the MIT Media Lab. It is provided free of charge.
    • Pre-processors for writing code in one’s native language: should be possible, I think, quite easily. In principle, each keyword and syntax contruct would map to its corresponding version in the “actual” language.

For example:

عمر = ۱؛
جب تک (عمر < ۱۲) {
لکھو "میری عمر " + عمر + " سال ہے مگر میں بڑا ہو کر بڑا آدمی بنوں گا! "

Unfortunately, to get all the characters I needed, I had to use both the “standard” Urdu keyboard and the phonetic one.

Or, in French:
âge = 1;
jusqu_a(âge < 12){
écris(“quand je serai grand, je serai un grand homme!”);
This, oddly enough, turned out to be non-trivial as keys like < and > as well as the plus sign are not easy to find in the French keyboard (too many keys taken over by their accented keys, in my opinion!)
It’s much worse in Urdu as WP’s display mechanism, at some point, totally screws up text direction.

  • Junaid teaches the “intro to computing” at Namal using a set of resources called Computer Science Unplugged. The idea is to introduce learners to “thinking like a machine” – a process which is non-intuitive (especially serial processing) before even getting into programming (though they do go there). To quote at some length from their site:

The primary goal of the Unplugged project is to promote Computer Science (and computing in general) to young people as an interesting, engaging, and intellectually stimulating discipline. We want to capture people’s imagination and address common misconceptions about what it means to be a computer scientist. We want to convey fundamentals that do not depend on particular software or systems, ideas that will still be fresh in 10 years. We want to reach kids in elementary schools and provide supplementary material for university courses. We want to tread where high-tech educational solutions are infeasible; to cross the divide between the information-rich and information-poor, between industrialized countries and the developing world.
There are many worthy projects for promoting computer science. The main principles that distinguish the Unplugged activities are:

1 No Computers Required
2 No Computers Required
3 Real Computer Science
4 Learning by doing
5 Fun
6 No specialised equipment
7 Variations encouraged
8 For everyone
9 Co-operative
10 Stand-alone Activities
11 Resilient

But how is that relevant to our work? I’m not entirely sure, but as we’re building a repository of ideas, I thought that these go together in terms of making information processing skills accessible to a larger number of people.

hm.StructureMap: Categorization tool

I had created this Java based categorization tool somewhere around 2000. It is possible to use hm.StructureMap to tag content (files, URLs) – hierarchically if necessary.

Last I updated the user guide was way back in 2004 and this document should give a good idea of what the tool is about. I have used and updated the program since but the changes are not significant.

NakedPunch visual navigation

This process was started some time ago, hit some snags and now back on track. The idea was to categorize the articles and create a visual navigation page.

This Python script scrapes the data off the NakedPunch site and outputs to the terminal a ‘%’ separated text file with the URL, Title, Author and Blurb columns. Most of the work is done by the BeautifulSoup library.


import requests
from BeautifulSoup import BeautifulSoup

prefix = ''

print 'url%title%author%blurb'           

for page in range(1, 20):
    url = prefix+"?page="+str(page)
    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html)
    alldivs = soup.findAll('div', 'article-summary')

    for div in alldivs:
        url = div.h4.a['href'].encode('utf8').strip()
        title = div.h4.text.encode('utf8').strip()
        author = div.div.text[3:].encode('utf8').strip() # remove leading 'by '
        blurb = '"'+div.p.text.encode('utf8').strip()+'"'
        print "%".join([url, title, author, blurb])

Each entry could then be tagged…