XML and the Natural Language Toolkit

February 26, 2010 in Technical, Texts

I’ve been playing with the nltk (natural language toolkit) and the really useful Jon Bosak xml annotated corpus these days,  and  this are some of the graphs I’ve been able to parse after analyzing the speech of the main characters of the play (characters that say more than 100 lines of code:

exclamations and interrogations

Here we can see that Macduff is screaming a lot, and that when everybody talks is never to question, but to assert… Poor Macbeth and Lady Macduff question everything, while Lady Macbeth just as much as asserting.

Regarding amount of words in the play, by far Macbeth is the one that talks more:

amount of words spoken by main characters

But what about lexical variety? In this next graph, we can see the variety of the words:

Macbeth - lexical variety

Here we can see the variety of characters speech.

The brown-ish words are said just once per character. The light greens are word that will repeat on their speech, and the dark greens are repetitions of the light green words. I still need to take more measures to see if this is actually the way everybody speaks: by repeating a lot of small words with just some new words once in a while. (There are more words that appear just once, than the words you will repeat through most of your speech! Think about it!)

2 responses to XML and the Natural Language Toolkit

  1. Hi there,

    Thanks for the graphs – they are very interesting!

    I’m curious, in the first graph on exclamations and interrogations, MacBeth has shorter bars than, say MacDuff despite having more lines, so I’m guessing you normalised in some way.

    Did you look at the ratio of the number of ‘!’/'?’ characters in a character’s speech to the number of words, lines or complete sentences spoken by the character – or did you do something else entirely?

    Many thanks again!

  2. Ingrid: the answer is very simple: I just counted the ‘!’s and ‘?’s per character, and MacDuff may not talk much, but when he does is with exclamation signs… where Macbeth is less prone to use them. I haven’t parsed the ratio of them per word… that is a good idea for next graphs! The nltk scripts I’ve used are in the contrib folder on the project sourcecode: http://knowledgeforge.net/shakespeare/hg/log?rev=nltk

    Thanks for the idea Ingrid, I will see which other numbers I can infer…

