Bringing open tools to public-domain literature

You are browsing the archive for Technical.

Open Shakespeare at OKCon 2011

July 3, 2011 in Musings, News, Shakespeare, Technical

OKCon 2011, at the Kalkscheune buildings in Berlin, was fantastic, and I thought it would be a good idea to publish a few reflections on some of the stuff that was going on there, both for the benefit of those who did not make it nor watch the live feeds, and for the chance it offers of mapping Open Shakespeare’s position in the wider Open Knowledge community.

Rufus Pollock provided the opening address, pointing out how the convergence of the two phenomena of greater data availability and advanced computing power had created the perfect conditions for openness to flourish. He announced one such flourishing in the form of datacatalogs.org, which came online at the start of the conference. His next point was to argue that the focus of activities in the community was moving from making data accessible to providing tools for and building communities around that data. Of course, the quantity problem is only half solved (a later speaker pointed out the small quantities of open government data in Asia, for example), but was still at a point where data cycles (ecosystems of community, tools and data) could be founded. This last point fits neatly with Open Shakespeare, since the project is slowly forming just such a cycle: early editions of Shakespeare’s plays are open data, and a small community is either building tools (like the annotator) or using them to create more content about Shakespeare’s works, which in turn offers new programming challenges and so completes the circle.

Glyn Moody’s keynote talk, immediately following Rufus’, approached the topic of Open Knowledge from a different angle, by analysing the current situation in terms of a new abundance which placed pressure on systems, such as the UK’s copyright law, designed for eighteenth-century conditions of scarcity. Although Moody did not mention it, Shakespeare himself was something of a forerunner in this domain: the “fourteen years plus fourteen more” model of copyright established in 1710 was the result of bookseller lobbying, not least that of Jacob Tonson, eager to protect his monopoly on the works of Shakespeare and others (notably Milton, and Dryden’s translations of Virgil). Having sketched out his model of abundance and scarcity, Moody concluded with the provocative question of how open projects would function without copyright, pointing out that many in fact depend upon restrictive legislation as their raison d’être. The only answer that I can give is that open projects would perhaps continue as the first models of communities where exchange and collaboration are well established (as in Open Shakespeare), that is to say, continuing as, in other words, those “data cycles” and “ecosystems” that Pollock had described as the successors to the victories of open data availability.

Later on in the conference, in the second track of talks, a panel on ‘Data Journalism: What Next?’ provided considerable food for thought on the topic of communities, much of it served up by the Guardian’s Simon Rogers. It was he, for example, that questioned the merits of crowd-sourcing, arguing that it did not provide objective data, since its contributors could be extremely biased, an MP participating, for instance, in the crowd-sourced analysis of his own expenses. This point was backed up by Stefan Candea, with both he and Simon Rogers emphasising the important labour that remained for the journalist when it came to looking over crowd-sourced responses and shaping them into a story. A neat example of this was the Guardian’s exploration of Sarah Palin’s emails, where users were directed to a random email and then asked to signal anything of interest. Although not flawless (one imagines a Palin aide slaving away to hide significant correspondence), its randomness nevertheless provided an even coverage of the files. This randomness might be an important tool for Open Shakespeare’s own crowd-sourcing of annotations, as a way of directing users to annotate less-appreciated works. As regards the verifiability of these annotations, Open Shakespeare has the problematic luxury of considering subjective opinion on the Bard’s art as valid as objective facts about it, since these opinions map the contours of contemporary attitudes to Shakespeare. Further, the intense subjectivity of responses to art means that such subjective annotations do not suffer from the problem of verifiability, because no such critical response has ever been verifiable (for those interested, this line of argument is behind Kant’s description of “universal subjective validity” in his Critique of the Power of Judgment).

It is on this idea of subjective annotation, the generation of subjective data, that I would like to bring this summary to a close. The conference was on Open Knowledge, but it is significant that I found the adjective to have been discussed far more often than the noun. Open Shakespeare’s annotation system, the tool that generates its data cycle, provides both verifiable information (“mirth in funeral” is an example of “synoeciosis” in Hamlet) and subjective opinion (“Words, words, words” is, for one user, “one of the most human lines in the play”). Is the second still data? I would argue that it is, but it is of a kind rarely discussed in Berlin. After all, what are we to do with it in order to integrate it back into the system of open data? Such opinion does not atomise easily, just as Shakespeare’s own words resist, with their context and their double meanings, computerised analysis. We can count the instances of the word “prune”, but it takes an article on the subject to bring out the humour from the information generated by the open-source tool. That article itself is data and can be itself the launch pad for new responses, but it moves the axis of the cycle away from developers’ tools and their data and towards the perspective of the user and, more broadly, that of the community. Rufus Pollock was right to argue for the existence of ecosystems of open data, but the case of Open Shakespeare shows that they can only be fully functional if all three elements are given their full weight: tools, data, and users together.

How to Participate in the Annotation Sprint

February 5, 2011 in Community, Publicity, Technical

The votes are in! We are annotating Hamlet

Until 11:30am you can: Vote for the play to be annotated

Any feedback, or thoughts? Use the etherpad to leave your thoughts about the event.

How to Participate

Step 0: Check your browser

To participate in the annotation sprint, you will need a recent version of Firefox or Chrome or Safari.

Step One: Login to Open Shakespeare [optional]

[optional]: you don’t need to login — but if you don’t your contributions will be anonymous.

To login you’ll need to obtain an OpenID if you don’t have one. Here’s how:

  1. Visit https://www.myopenid.com/

  2. Click on the button ‘Sign up for an OpenID’

  3. Follow their instructions to create an OpenID by which you will be known when annotating

Now you’ve got an OpenID you can login:

  1. Go to our login page

  2. Click on the ‘OpenID’ button

  3. Copy and paste, or type out your OpenID, which looks like a web address

Step Two: Start Annotating!

  1. Go to our works page and click on ‘annotate’ beneath the chosen play

  2. All the instructions are written on the side of the page in the ‘Annotation: Howto’ column

Online Editions of Shakespeare

January 15, 2011 in Community, Musings, Technical, Texts

The story of Shakespeare on the internet is a tangled tale, and this post is an attempt to unravel it. In expounding the advantages and shortcomings of online editions, I hope also to explain a few of the problems Open Shakespeare faces.

Editions Used by Open Shakespeare

Every work on the Open Shakespeare website has three possible texts, and it is worth explaining their provenance here in detail:

GUTENBURG FOLIO – These are drawn from Project Gutenberg, with the editorial prefaces removed. Nothing else has been changed. The Gutenberg scanner claims that the text “is as close as I can come in ASCII to the printed text,” however it is important to record here several features of his methodology.
- Some spelling “mistakes” have been corrected according to a dictionary created from the spellings of the Geneva Bible and Shakespeare’s First Folio.
- Typos and abbreviations have also been “corrected”
- “Elongated S’s have been changed to small s’s and the conjoined ae have been changed to ae.”
- The actual text itself is composite, made from “30 different First Folio editions’ best pages”

GUTENBERG – Again taken from Project Gutenberg, this time from a more fully edited edition, with a cleaner layout, and the inclusion of 18th century stage directions. Open Shakespeare, as is usual for us, has removed all the prefatory material but kept the edited text as is. Unfortunately, nothing is disclosed about the process of editing or the source texts used except for the single phrase “This etext was prepared by the PG Shakespeare Team, a team of about twenty Project Gutenberg volunteers.”

MOBY – This text comes from the most widely available online edition of Shakespeare, of whose advantages and shortcomings there is a useful summary on the Open Source Shakespeare website.

Other Online Editions: ISE and Wordhoard

ISE

The principle website for online editions of Shakespeare is ISE (Internet Shakespeare Editions) where the following are offered, taking their entry for Hamlet as an example:

TEXT EDITIONS – These cover modern spelling and unmodified spelling versions based on the first folio and quarto 1 and 2, all of which have been edited. In the case of Hamlet this editing has been done by David Bevington, a scholar of some note. For other editions, the editors are less well known, and in many cases there has not yet been a peer review.

FACSIMILES – This is perhaps the real strength of ISE: several different First Folios have been scanned, and the results are very impressive. They also have facsimiles of the 1603 and 1604 quartos of Hamlet.

ANNOTATED EDITIONS – One of these does not yet exist for Hamlet, but David Bevington has again produced a useful peer-reviewed edition of As You Like It, on which one can toggle his annotations and record of collations.

COPYRIGHT – Everything on the ISE is under a variety of copyrights. The copyright for the edited texts uis owned by the editor, and the images that make up the facsimiles have a rather ambiguous copyright situation, depending on their source. Although, ISE state, “All items published on the site of the Internet Shakespeare Editions…may in all cases…be used for educational, non-profit purposes”, quite where an Open License website like our own fits in is deeply ambiguous, since material published on our website could feasibly be used for commercial purposes.

Wordhoard

Provided by Northwestern University, this website provides a set of texts worthy to serve as definitive online editions of Shakespeare. Along with other authors’ works, one can download two versions of Shakespeare’s writings: one encoded in TEI, the other linguistically annotated – which is to say every word in the text is associated with a lemma and part of speech.

For me, the most exciting part of this project is the way in which these lemmatized texts can be manipulated. Northwestern University gives one example: a short program written to answer the question ‘Does Shakespeare use mostly the same vocabulary in each of his works, or does he use different vocabulary?’. I recommend visiting the website for the answer, and for a wealth of other little bits of information about Shakespeare’s vocabulary.

The copyright position of the wordhoard project is complicated. However, the website’s stance is far more ‘open’ than that of the ISE, so collaboration between Wordhoard and Open Shakespeare may be a possibility in the future.

Shakespeare Quarterly part II

April 6, 2010 in Community, Musings, Publicity, Technical, Texts

Here, for those interested, is my response to Professor Andrew Murphy’s article in the Shakespeare Quarterly:

“I am a member of the Open Shakespeare Project (www.openshakespeare.org – not to be confused with Open Source Shakespeare) and found this article extremely interesting. I feel that your conclusion points towards many of the approaches to Shakespeare that our project incorporates, and that are part of a more ’social’ approach to Shakespeare.

It occurs to me that as well as spreading Shakespeare to a far larger audience, cheap editions of Shakespeare are also a godsend for students, who may write their thoughts all over their pages without fear of ruining something expensive. If all these scribbles were collected, a formidable body of knowledge of Shakespeare would be available, as would an evolving record of responses to this writer.

Our site has recently acquired the ability for anyone to annotate Shakespeare’s works, and soon will add the capacity to attribute, tag, sort, and hide the annotations made. With this we hope to create an ‘open’ edition of Shakespeare’s plays that would grow along similar lines to Wikipedia, harnessing the power of the internet to bring many minds to bear upon a single subject.

Such problems as found with the OSS still pose difficulties for us: we have to use Moby as a source text since all others, including (lamentably) the wordhoard text, are under copyrights that conflict with our Open license. Nevertheless, just as textual problems are flagged up in a critical edition with a footnote, so too could such problems be drawn to the reader’s attention through annotation. As Whitney Trettien’s article points out, the web comes into its own when it is an ‘expressive medium’ itself, and not one which, like the OSS, unthinkingly delivers content.

Essentially, ISE already has this kind of thinking process, displaying an editor’s annotation on each text right down to the textual variants. It even has the ability to sort such annotations. However, the problems you identify – different kinds of editing, slow progress, uneven quality – all inevitably result, I feel, from the fact that each text only has a single editor. More editors would speed progress but it is not, of course, a given that more editors would improve quality. Wikipedia is still notorious for its occasional inaccuracies.

Nevertheless, such inaccuracies can be resolved by the same process that generates them. If anyone can annotate, so anyone can also review annotation and improve it. I realise that this is a rather utopian position and that people can as easily vandalise as beautify, but I feel it to be a more tenable one than that held by the websites here. The internet allows for unprecedented levels of input as well as appreciation, and such potential is not exploited by the sites reviewed in this article.

Talking of input and appreciation brings me to one further aspect of these sites that interests me, namely how easily one can print from them. The OSS shines in this respect, but attempting to print an ISE fascimile is rather more difficult. I must also admit that printing from an annotated text at The Open Shakespeare Project is currently impossible: the tool only went live fairly recently, and the site is still very much under construction. One day we hope to harness the accumulated and peer-reviewed annotations of many to produce a printed text, and thus complete a cycle between internet and ‘real world’ Shakespeare.

Such a cycle is ignored at the peril of digital scholarship, for it is the mix of real events and online responses to them that makes Facebook so addictive. Other addictive qualities, such as the relatively small time commitment and the chance to interact with other users could be profitably replicated by internet Shakespeare projects. After all, anything capable of sustaining those involved in the long task of making productive use of Shakespeare is always welcome and need not be to the detriment academic rigour.”

Here is the author’s reply:

James: thanks very much for this thoughtful and very interesting response to the review. I’ve had a quick look at your site and think it’s very interesting. It seems to me that you really are pushing forward with a Web 2.0 approach to things, making your site a good deal more interactive than the three I review here. I like the idea of building up a ‘database’ of annotations — and you’re right, of course: textual annotation might be a way round the problems of having to use an outdated source text. I still tend to worry about Wikipedia as a model, however. I always like to tell my students stories of humourous examples of deliberate tampering with Wikipedia, as a way of warning them off using it in their research (perhaps you may know what happened to Thierry Henry’s page, after France put Ireland out of the World Cup?). Will OSP be entirely ‘user governed’, or will you have some sort of ‘top down’ quality control mechanisms? Andy

The discussion raises some interesting issues. How bitesize and user friendly is our website? To what extent should ‘Open Shakespeare’ be user-governed? Any comments and suggestions you may have will be very welcome.

Annotation is here!

March 16, 2010 in Community, News, Releases, Technical, Texts

The fabled ability to annotate any text of Shakespeare is now part of the Open Shakespeare website! Massive thanks to Nick for all his work on something far too complex for me to even describe its complexity (apparently there were difficulties with there being ‘no TextRange in the DOM’).

Here’s how to get annotating:

  1. Click ‘read texts’ on the homepage.
  2. Scroll down to find your play of choice in the list and click on ‘annotate’.
  3. Find the line you wish to annotate, then highlight it, then click on the little notepad that appears.
  4. In the newly-present dialogue box, type your words of wisdom.
  5. Press enter to save your annotation and close the dialogue box.

Work has already begun on Hamlet, but feel free to annotate wherever you wish.

As to what you should write in an annotation, we currently have no guidelines: shorter is usually better, and, obviously, offensive comments will be removed – but apart from that, all insights and explications are very welcome.

Improvements to come include: restricting editing and deletion to the owner of each annotation, showing user information on annotations, the ability to filter annotations, and the capacity to use markdown in each comment.

Editions

March 15, 2010 in Musings, Technical, Texts

There’s a famous line in Hamlet: “O that this too too solid flesh would melt” (1.ii.129). Not only is it the start of an agonised soliloquy in which Hamlet tortures himself over his mother’s apparent desire for her dead husband’s brother, but it is also a line over which many generations of scholars have wrangled. You see, there are several different editions of Hamlet: a first quarto printed in 1603, and then another in 1604, before the folio edition appeared in 1623. The quartos (so named for being the size of a quarter of a sheet of paper) would normally be used for any critical text because they are the earliest. Unfortunately, the quartos for Hamlet are so corrupt that they can’t really be trusted. Nevertheless…they still might contain passages that are more correct than the folio, composed after Shakespeare’s death, ever could be.

To return to that line of Hamlet: the folio has ‘solid flesh’, but the first quarto has ‘sallied flesh’, and the second quarter has either ‘sallied’ or ‘sullied’. Each variant changes the way we see Hamlet.

But what does this have to do with Open Shakespeare? Well, this little example shows how important it is to have a reliable text for each play, especially now that we will be annotating and one day producing critical editions from them. Currently, we have the Gutenberg text of the first folio, although, like many other first folios, this text is actually a hodgepodge of other first folios recomposed sometime in the 18th Century. We also have the Moby Shakespeare, so called for the man who produced the most widely circulated digital version of Shakespeare’s plays – but without saying what edition he used…

Having consulted with a few professors here in Cambridge (credit where it’s due: the info about composite folios comes from Prof. Kerrigan), it appears that there is a first folio actually in Cambridge. If we could find a way of digitising it, this would be a great benefit to Open Shakespeare, establishing, if not a ‘perfect’ text (which, once the Globe and Shakespeare’s own playtexts burnt down during a performance of Henry VIII could never now be possible), at least one with some historical authority.

I have no idea how we will digitise the Cambridge folio, so any suggestions would be welcome. I heard once that a young Arthur Miller, in order to hone his play-writing skills, copied out almost all of Shakespeare’s plays by hand. So, if you’re an aspiring playwright with lots of time on your hands, do get in touch.

XML and the Natural Language Toolkit

February 26, 2010 in Technical, Texts

I’ve been playing with the nltk (natural language toolkit) and the really useful Jon Bosak xml annotated corpus these days,  and  this are some of the graphs I’ve been able to parse after analyzing the speech of the main characters of the play (characters that say more than 100 lines of code:

exclamations and interrogations

exclamations and interrogations

Here we can see that Macduff is screaming a lot, and that when everybody talks is never to question, but to assert… Poor Macbeth and Lady Macduff question everything, while Lady Macbeth just as much as asserting.

Regarding amount of words in the play, by far Macbeth is the one that talks more:

amount of words spoken by main characters

amount of words spoken by main characters

But what about lexical variety? In this next graph, we can see the variety of the words:

Macbeth - lexical variety

Macbeth - lexical variety

Here we can see the variety of characters speech.

The brown-ish words are said just once per character. The light greens are word that will repeat on their speech, and the dark greens are repetitions of the light green words. I still need to take more measures to see if this is actually the way everybody speaks: by repeating a lot of small words with just some new words once in a while. (There are more words that appear just once, than the words you will repeat through most of your speech! Think about it!)

Avatar of admin

by admin

OCRing Shakespeare Entry from Encyclopaedia Britannica 11th Edition

August 14, 2007 in Technical, Texts

One of next things we want to do for open shakespeare is provide an open introduction for to his works. The obvious idea for this was to use the Shakespeare entry in the 11th ed of the Encyclopaedia Britannica as detailed in this ticket:

http://p.knowledgeforge.net/shakespeare/trac/ticket/24

We’ve now written code to grab the relevant tiffs off wikimedia:

http://p.knowledgeforge.net/shakespeare/svn/trunk/src/shakespeare/src/eb.py

You can also find them online (28 pages) starting at:

http://upload.wikimedia.org/wikipedia/commons/scans/EB1911_tiff/VOL24%20SAINTE-CLAIRE%20DEVILLE-SHUTTLE/ED4A800.TIF

Next step is to then OCR this stuff (after that we can move on to proofing whether by ourselves or via http://pgdp.net). When we first had a stab at this back in April we tried using gocr. Unfortunately the results were so bad that they were unusable. Recently an old ocr engine of HP’s has been released as open source under the name of tesseract:

http://code.google.com/p/tesseract-ocr/

We’re going to have a go using this — though if there is anyone out there with access to an alternative system we’d love to hear about it.

Avatar of admin

by admin

Annotation is Working!

April 10, 2007 in News, Technical

After another push over the last few days I’ve got the web annotation system for Open Shakespeare operational (we’ve been hacking on this on and off since back in December).

To see the system in action visit:

http://demo.openshakespeare.org/view?name=phoenix_and_the_turtle_gut&format=annotate

Quite a bit of effort has been made to decouple the annotation system from Open Shakespeare so that it can be easily reused elsewhere. You can find the code for the annotation system (nicknamed annotater) here:

http://p.knowledgeforge.net/shakespeare/svn/annotater/trunk/

There are still some substantial issues with the Open Shakespeare implementation the most obvious of which are:

a) large texts bring the javascript to its knees ((The Phoenix and the Turtle is the shortest of Shakespeare’s works which is why I’m using it).

b) security/user authentication for annotation adding/editing/deleting

But the basic system is working.

Avatar of admin

by admin

Improvements to the Concordance

January 3, 2007 in Technical

One of the main items scheduled for v0.4 of open shakespeare is improvements to the responsiveness of the concordance. Using the v0.3 codebase, using just the sonnets as test material, loading up the list of words for the concordance alone took around 24s on my laptop. This is because even with a single text there are already over 18,000 items in the concordance and we were having to read through all of these to generate the list of words. Some recent commits (e.g. r:72) have gone some way to improving this responsiveness (loading word list is now 3s now compared to 24s) but the result is not entirely satisfactory (printing full statistics is 13s compared to 40s previously). One obvious way to go futher is to use caching — either of individual web pages or of particular key parts such as all the distinct words occurring in the concordance (caching works because the concordance only changes when new texts are added which will usually only happen once — when the system is first initialised).

Relatedly and r:74 is a first step on filtering the concordance — in this case to exclude roman numerals and various non-words. Doing this made me think about whether the concordance should be storing actual words or just stems — for example, it does not seem to make much sense to have different entries for kill, kills, killed etc. Using a stemming algorithm such as the porter stemmer (which I notice has a nice python implementation directly available) we can easily stem each word as we go along. This would have several benefits one of the most prominent being a dramatic reduction in the basic dictionary size (i.e. the number of distinct words in the concordance).