UTS Library

A Day Trip to Google Refine Country

Recently I attended a conference about using technology for the humanities (THATCamp) in Canberra. Among the many distinguished and switched on librarians, educators, curators and coders there was Intersect’s Luc Small, who together with Richard Lehane from the State Archives NSW presented a session on Google Refine. I’d seen a short movie from Google about their new Refine product, and it seemed to be a nice enough web facsimile of Excel, in much the same way that Google Docs mimics Microsoft Word. In the video I saw how Google Refine could search a database for similar sounding words and standardize their spelling for better results. It was obviously a useful tool, but I wondered what it would bring to the table to really help researchers, especially in the humanities. Enter Richard and Luc. They opened my eyes to many things in the hour or so session, perhaps more than my mind could easily grasp in one sitting, but despite the technical complexity of the techniques they were using the results were undeniably exciting and promising, and I thought I’d share them here.
We began by importing a text file into Google Refine containing the addresses of various NSW Police Stations, then went about searching for misspelled area names and duplicate addresses. Refine has a couple of ways of doing this, both by similarly spelled words (key collision) and also words that sound similar (metaphone). Then they started playing with API’s.
If you don’t know what an API is, don’t worry, neither did I. API stands for ‘application programming interface’, and they provide an interface between different software programs, in this case to assist with the retrieval of information. Luc said that Refine allows you to use API forms without much knowledge of coding language (Javascript in this case), and where coding is involved Refine will at least instruct you as to whether the syntax of your commands is correct or not.
First we visited Google for their geolocation API. The results that were imported into the database were at first indecipherable but using some coding wizardry it was broken into two separate bits of information – latitude and longitude - for all of the police stations in the list. Then we used a ‘scatterplot’ facet within Google Refine to render these co-ordinates as a graph. What appeared was a small map of NSW with dots representing various police stations. Impressively, you could actually draw a square with your mouse around various dots on the graph and Refine would then display the police stations within the square in the database.

Scatterplot graph drawn by Google Refine, showing NSW police station locations:

We then retrieved another API and add it to the Refine database – this time from the State Records of NSW. This added extra information to our spreadsheet for State Records Titles and URL’s for the various Police Stations in our list (where available). We even managed to dig up Historical Notes and Date ranges for the police stations from the State Archives.
Finally, the database was exported to CSV (excel compatible format)

Google Refine in all its (blurry) glory, chock full of API sourced goodies..

This was a wonderfully impressive demonstration, however it did require some coding abilities using javascript. If you are doing research using public information and would like to experiment with the potential of Refine, but as perplexed as I am by the expression parseJson(value).agency.End_date.substring(0,4)
... then you might want to talk to someone with the necessary know how. One such man is of course Dr Luc Small, who works as a Research Analyst for Intersect, and who is based at UTS. His contact details are luc.small@intersect.org.au
I have a lot to think about after THATCamp, and many new techniques and tools to investigate. I’d never really thought about coding as a means to the end of research before, and whilst it’s daunting I can’t help but think it will play an increasing role in high level research. So I guess writing this blog is only the start of my homework.