Text mining the Old Bailey online

The Old Bailey Online

Part two here

I have encountered the Old Bailey online a number of times before so know my way around the site quite well. This week our focus was on exploring how the sites API could allow us to mine the data from the site to use with other tools, in particular Voyant. I will briefly cover how this worked with some screen shots in this post and will spend more time in a longer second post discussing the ‘Circulation of Knowledge and Learned Practices in the 17th-century Dutch Republic’ project at Utrecht University.

I was curious to indulge my interest in ‘witchcraft’, so initially searched for this as a keyword in the main search section:

witchessearch

Following this I turned to the Old Bailey API Demonstrator. This allows the possibility to export results to be used by other tools. I used this to export results to voyant discussed in last weeks post. The data from a search for ‘witches’ as a keyword was exported to Voyant. We can now use the tools available on Voyant to further interrogate the data of the old bailey in new ways. This one of the fantastic things about providing an API. It wasn’t necessary for the Old Bailey online to provide all of the tools of Voyant themselves. They could instead provide the underlying data in a way that allows other websites to make use of it. This can avoid the duplication of effort but also opens up the possibility of your data being used in new ways which hadn’t initially been envisaged.

Screen Shot 2014-11-24 at 12.03.16
The output of the old bailey viewed in Voyant.

The next post will look in more detail at one of the Utrecht Digital Humanities projects. The process of exporting the Old Bailey results to Voyant was very straightforward. The Old Bailey online provides a link to Voyant, so it is clear that some thought has gone into combing these two tools. This is not always the case. Sometimes data will be provided in ways which make it less easy to use. I will discuss this in more depth in part two but there are some exclent discussions of ways of dealing with this on the programming historian which is all round a great resource which I’m aiming to discuss in a separate blog about ‘geeky’ stuff!

Part 2: The nuts and bolts (I think)

This is part two of three blogs on the use of twitter for research. Part one is here.

How Tags works (probably)

What I hope to do in this post is identify some of the moving parts that make Tags work. I will not try and delve into the actual workings of the code used as that is beyond my (hopefully current) abilities. What I will try and do instead is identify the different components which interact with each other to make Tags work. At the end of the post I will highlight some of the resources which I intend to use in gaining a better understanding of the inner workings of Tags. With a bit of luck I will be in a position to attempt to build something simple myself using what I’ve learnt.

The Nuts and Bolts

An API

The Twitter API is what allows access to the data of twitter. Having access to an API for twitter is required before we can think of developing methods of collecting Tweets. Once an API is set up and authorised we can start using other software to make use of the data in Twitter.

Google Apps Scripts

Google Apps Scripts is used to get data form Twitter, through the API, into a google spreadsheet. This script is what is going on behind the scenes when you ‘Run Tags’. It is these scripts which ‘talk’ to twitter in order to get our search, for example my search for #open access from my Google spreadsheet to twitter and back again with the appropriate tweets in tow.

Some computer languages

Google Apps Script makes use of Javascript – a common language used for web development. It is cloud based and many scripts can already be found which can be adapted for different purposes.

HTML – is used to make the TAGS website itself. HTML is a fairly accessible language and since it’s primary purpose is formatting text for websites, the logic we need to approach it with isn’t so unfamiliar to us.

Alongside this are a whole host of other languages which are running in the background. The purpose of this post isn’t to be exhaustive but to try and give myself a better understanding of some aspects of coding I may want to pursue.

Scary gobbledegook!

Before highlighting some of the resources available for learning more about this coding business I decided to have a look at a little bit of code myself and see if I could make any sense of it. This is a bit of code which updates tags so it works with the new version of Twitter’s API. You can have a look at the code on Github.

There is a lot I don’t understand about the inner workings and syntax of the code but hopefully I can point out some of the general principles. This is largely an exercise in trying to convince myself (and with a bit of luck you!) that this coding business is not beyond comprehension.

I have cut some snippets of the code and will summarise what I think each bit is roughly about. It is quite possible I am completely of the mark (feel free to tell me if this is the case).

function getTweets(searchTerm, maxResults, sinceid, languageCode) {
    //Based on Mikael Thuneberg getTweets - mod by mhawksey to convert to json
    // if you include setRowsData this can be used to output chosen entries

This is a description of what the purpose of the code is.

  var data = [];
  var idx = 0;
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sumSheet = ss.getSheetByName("Readme/Settings");
  if (isConfigured()){
   var oauthConfig = UrlFetchApp.addOAuthService("twitter");
    oauthConfig.setAccessTokenUrl("https://api.twitter.com/oauth/access_token");
    oauthConfig.setRequestTokenUrl("https://api.twitter.com/oauth/request_token");
    oauthConfig.setAuthorizationUrl("https://api.twitter.com/oauth/authorize");
    oauthConfig.setConsumerKey(getConsumerKey());
    oauthConfig.setConsumerSecret(getConsumerSecret());
    var requestData = {
          "oAuthServiceName": "twitter",
          "oAuthUseToken": "always"
        };
  } else {
    Browser.msgBox("Twitter API Configuration Required")
  }

Code is often written with a similar basic logic (it does get more complicated of course). This logic often goes along the lines of: trying doing something, if that doesn’t work do this other thing. This section of code is I think trying to establish a link between the spreadsheet and twitters API. The ‘if’ is trying to get the Oauth access token.[1] The ‘else’ tells the programme what to do if the API hasn’t been authorised. It will display a message saying ‘Twitter API Configuration Required’.

The rest of the code starts to make less intuitive sense to me. I will try to understand what this means at some point!

Resources

Here is a small list of resources which I intend to pursue. They are in a somewhat rough order based on the criteria of what I think will be both most accessible and immediately useful to me (and maybe more generally for non computer expert librarians).

HTML and CSS

HTML is probably one of the most accessible languages to get a basic grip on. I already have some experience with HTML and LaTeX[2] which is fairly similar. I don’t have any experience with CSS but it is included with many of the HTML tutorials and since it is a big part of styling websites it makes sense to try to learn this alongside HTML.

  1. Code Academy
  2. W3schools
  3. http://learn.shayhowe.com/html-css/

There are plenty of other sites available and many free resources so it is probably best to just try them out and see what you like.

Google Apps Scripts/Javascript

Since Tags works using Google Apps scripts I am intrigued to explore how this works. Google provides some introductions and tutorials on their site. Google scripts works on the basis of Javascript and it is another language heavily used on the web so I think it makes sense to explore this a little bit.

  1. Google Apps Scripts
  2. Code Academy Javascript
Ruby

Another language commonly used on the web. I have often heard it described as a very ‘elegant’ coding language. Might as well learn a bit of an ‘elegant’ language alongside a more ‘ugly’ language like Javascript! There also seems to be lots of good resources for this so it is definitely on my to do list.

http://tryruby.org/
This is a fun website that gives you a chance to try some basic coding using Ruby.

  1. Google Apps Scripts
  2. Try Ruby
  3. Introduction to Ruby comic!

There is quite a lot to get on with here. Whether I find time to really ‘learn’ any of these languages fully during the rest of the DITA language is doubtful but I do hope to make a start and see how I get on. If anyone has any tips feel free to drop a comment below!


[1] This was discussed briefly in the DITA lecture. It is is essentially an open standard for ensuring secure authorisation of different applications. There site is here http://oauth.net/

[2] LaTeX is a ‘document preparation system’. It can be used instead of a ‘what you see is what you get’ (WYSIWYG) programme like Word or LibreOffice to prepare text documents. Instead of formatting everything by hand as you do with Word or LibreOffice you indicate the structure with some syntax and LaTeX ‘typesets’ the document for you. It tends to produce much more attractive documents then WYSIWYG programmes and is pretty easy to use once you have got a hang of the logic behind it.

APIs part 1.

This week in DITA the topic was APIs. I had encountered the term before and had a pretty good idea of what they were.1 I hadn’t spent much time previously trying to use APIs, or at least not deliberately. It turns out that many of the pages I use on a regular basis make use of APIs. Before getting to that though it might be worth giving a brief overview of what an API is.

What is an API?

“In computer programming, an application programming interface (API) specifies a software component in terms of its operations, their inputs and outputs and underlying types. Its main purpose is to define a set of functionalities that are independent of their respective implementation, allowing both definition and implementation to vary without compromising each other.
In addition to accessing databases or computer hardware, such as hard disk drives or video cards, an API can be used to ease the work of programming graphical user interface components, to allow integration of new features into existing applications (a so-called “plug-in API”), or to share data between otherwise distinct applications. In practice, many times an API comes in the form of a library that includes specifications for routines, data structures, object classes, and variables. In some other cases, notably for SOAP and REST services, an API comes as just a specification of remote calls exposed to the API consumers.” Wikipedia

When I read that explanation before the DITA lecture and lab it didn’t help much to clarify what an API was. However, in the context of the lecture and the lab exercises it starts to make a lot more sense. In particular the purpose of ‘defin[ing] a set of functionalities that are independent of their respective implementation’. An API allows some functionalities i.e. features, to be used independently of their respective implementations. Different ‘implementations’, otherwise known as applications, can use the basic features of a website like twitter in a new way.

What are the benefits of APIs?

Rather than talk about the programming logic that makes API work well I thought I would try and come up with some of the obvious ways in which an API is useful from the ‘users’ perspective.

  • I can use a service like Twitter through different applications that may have an interface which for me is more intuitive then the web version of twitter, or has additional features which make it more efficient to use for particular activities.

  • By providing APIs projects which have limited resources, or time, can allow other developers to open up access to a service on a different application. An example of this is Scholarley, an unofficial android client for Mendeley

  • A service like Mendeley can be used in many different ways, by many different people. Mendeley provides an extensive documentation for its API here. In this way there is some overlap between what APIs and open source software allow, although there are major difference between the two, something I will try and explore in a later post.

I will continue exploring APIs in part 2 of this post.


  1. My understanding was something along the lines: an API is a way in which a website like Twitter can be used by an application like Tweetdeck or Fenix and allow you to interact with the twitter service through an app. I don’t think this understanding was too far off but the lecture and lab definitely clarified how this works.