You are browsing the archive for Technical.


openliterature - March 15, 2010 in Musings, Technical, Texts

There’s a famous line in Hamlet: “O that this too too solid flesh would melt” (1.ii.129). Not only is it the start of an agonised soliloquy in which Hamlet tortures himself over his mother’s apparent desire for her dead husband’s brother, but it is also a line over which many generations of scholars have wrangled. You see, there are several different editions of Hamlet: a first quarto printed in 1603, and then another in 1604, before the folio edition appeared in 1623. The quartos (so named for being the size of a quarter of a sheet of paper) would normally be used for any critical text because they are the earliest. Unfortunately, the quartos for Hamlet are so corrupt that they can’t really be trusted. Nevertheless…they still might contain passages that are more correct than the folio, composed after Shakespeare’s death, ever could be.

To return to that line of Hamlet: the folio has ‘solid flesh’, but the first quarto has ‘sallied flesh’, and the second quarter has either ‘sallied’ or ‘sullied’. Each variant changes the way we see Hamlet.

But what does this have to do with Open Shakespeare? Well, this little example shows how important it is to have a reliable text for each play, especially now that we will be annotating and one day producing critical editions from them. Currently, we have the Gutenberg text of the first folio, although, like many other first folios, this text is actually a hodgepodge of other first folios recomposed sometime in the 18th Century. We also have the Moby Shakespeare, so called for the man who produced the most widely circulated digital version of Shakespeare’s plays – but without saying what edition he used…

Having consulted with a few professors here in Cambridge (credit where it’s due: the info about composite folios comes from Prof. Kerrigan), it appears that there is a first folio actually in Cambridge. If we could find a way of digitising it, this would be a great benefit to Open Shakespeare, establishing, if not a ‘perfect’ text (which, once the Globe and Shakespeare’s own playtexts burnt down during a performance of Henry VIII could never now be possible), at least one with some historical authority.

I have no idea how we will digitise the Cambridge folio, so any suggestions would be welcome. I heard once that a young Arthur Miller, in order to hone his play-writing skills, copied out almost all of Shakespeare’s plays by hand. So, if you’re an aspiring playwright with lots of time on your hands, do get in touch.

XML and the Natural Language Toolkit

adalovelace - February 26, 2010 in Technical, Texts

I’ve been playing with the nltk (natural language toolkit) and the really useful Jon Bosak xml annotated corpus these days,  and  this are some of the graphs I’ve been able to parse after analyzing the speech of the main characters of the play (characters that say more than 100 lines of code:

exclamations and interrogations

exclamations and interrogations

Here we can see that Macduff is screaming a lot, and that when everybody talks is never to question, but to assert… Poor Macbeth and Lady Macduff question everything, while Lady Macbeth just as much as asserting.

Regarding amount of words in the play, by far Macbeth is the one that talks more:

amount of words spoken by main characters

amount of words spoken by main characters

But what about lexical variety? In this next graph, we can see the variety of the words:

Macbeth - lexical variety

Macbeth - lexical variety

Here we can see the variety of characters speech.

The brown-ish words are said just once per character. The light greens are word that will repeat on their speech, and the dark greens are repetitions of the light green words. I still need to take more measures to see if this is actually the way everybody speaks: by repeating a lot of small words with just some new words once in a while. (There are more words that appear just once, than the words you will repeat through most of your speech! Think about it!)

OCRing Shakespeare Entry from Encyclopaedia Britannica 11th Edition

Open Knowledge Foundation - August 14, 2007 in Technical, Texts

One of next things we want to do for open shakespeare is provide an open introduction for to his works. The obvious idea for this was to use the Shakespeare entry in the 11th ed of the Encyclopaedia Britannica as detailed in this ticket:

We’ve now written code to grab the relevant tiffs off wikimedia:

You can also find them online (28 pages) starting at:

Next step is to then OCR this stuff (after that we can move on to proofing whether by ourselves or via When we first had a stab at this back in April we tried using gocr. Unfortunately the results were so bad that they were unusable. Recently an old ocr engine of HP’s has been released as open source under the name of tesseract:

We’re going to have a go using this — though if there is anyone out there with access to an alternative system we’d love to hear about it.

Annotation is Working!

Open Knowledge Foundation - April 10, 2007 in News, Technical

After another push over the last few days I’ve got the web annotation system for Open Shakespeare operational (we’ve been hacking on this on and off since back in December).

To see the system in action visit:

Quite a bit of effort has been made to decouple the annotation system from Open Shakespeare so that it can be easily reused elsewhere. You can find the code for the annotation system (nicknamed annotater) here:

There are still some substantial issues with the Open Shakespeare implementation the most obvious of which are:

a) large texts bring the javascript to its knees ((The Phoenix and the Turtle is the shortest of Shakespeare’s works which is why I’m using it).

b) security/user authentication for annotation adding/editing/deleting

But the basic system is working.

Improvements to the Concordance

Open Knowledge Foundation - January 3, 2007 in Technical

One of the main items scheduled for v0.4 of open shakespeare is improvements to the responsiveness of the concordance. Using the v0.3 codebase, using just the sonnets as test material, loading up the list of words for the concordance alone took around 24s on my laptop. This is because even with a single text there are already over 18,000 items in the concordance and we were having to read through all of these to generate the list of words. Some recent commits (e.g. r:72) have gone some way to improving this responsiveness (loading word list is now 3s now compared to 24s) but the result is not entirely satisfactory (printing full statistics is 13s compared to 40s previously). One obvious way to go futher is to use caching — either of individual web pages or of particular key parts such as all the distinct words occurring in the concordance (caching works because the concordance only changes when new texts are added which will usually only happen once — when the system is first initialised).

Relatedly and r:74 is a first step on filtering the concordance — in this case to exclude roman numerals and various non-words. Doing this made me think about whether the concordance should be storing actual words or just stems — for example, it does not seem to make much sense to have different entries for kill, kills, killed etc. Using a stemming algorithm such as the porter stemmer (which I notice has a nice python implementation directly available) we can easily stem each word as we go along. This would have several benefits one of the most prominent being a dramatic reduction in the basic dictionary size (i.e. the number of distinct words in the concordance).

Adding Web-Based Annotation Support

Open Knowledge Foundation - December 18, 2006 in Technical

We intend to add annotation/commentarysupport to the open shakespeare web demo either in this release or next. As a first step we’ve been looking to see what (open-source) web-based annotation systems are already out there. Below is our list of what we’ve been able to find so far (if you know of more please post a comment). After examining several of these in some detail the one we’re going to try our properly is marginalia (if you’re interested our current efforts to do this including writing a python wsgi annotation service backend can be found here in the subversion repository).

  1. stet: javascript annotation system used for gpl v3 comments system

  2. commentary: javascript based wsgi middleware developed by ian bicking

    • Rather hacked together (apparently he coded it in a week). Had problems getting it working locally and no documentation to help in adaptation. Seems to be unmaintained (demo site is currently down) which is perhaps not surprising given how many other projects Ian has on the go.
    • One nice feature is that you don’t seem to have to mess with the underlying web pages you want to add comments to (this only works if you are sitting on top of another wsgi application)
  3. marginalia: javascript library and spec for adding web annotation to pages

  4. annotea: W3C project based on RDF

    • Been around a long time and now seems to be inactive
    • Server and client support rather lacking. No simple interface based on, e.g., javascript — you have to write a special client yourself — which is a major drawback
    • That said the protocol is well-documented and so writing a client (or a server) shouldn’t be that hard (other than having to mess around with rdf in javascript …)
    • The Schema seems reasonable
    • xpointer based which according to the marginalia site is a problem