You are browsing the archive for adalovelace.

XML and the Natural Language Toolkit

February 26, 2010 in Technical, Texts

I’ve been playing with the nltk (natural language toolkit) and the really useful Jon Bosak xml annotated corpus these days,  and  this are some of the graphs I’ve been able to parse after analyzing the speech of the main characters of the play (characters that say more than 100 lines of code:

exclamations and interrogations

exclamations and interrogations

Here we can see that Macduff is screaming a lot, and that when everybody talks is never to question, but to assert… Poor Macbeth and Lady Macduff question everything, while Lady Macbeth just as much as asserting.

Regarding amount of words in the play, by far Macbeth is the one that talks more:

amount of words spoken by main characters

amount of words spoken by main characters

But what about lexical variety? In this next graph, we can see the variety of the words:

Macbeth - lexical variety

Macbeth - lexical variety

Here we can see the variety of characters speech.

The brown-ish words are said just once per character. The light greens are word that will repeat on their speech, and the dark greens are repetitions of the light green words. I still need to take more measures to see if this is actually the way everybody speaks: by repeating a lot of small words with just some new words once in a while. (There are more words that appear just once, than the words you will repeat through most of your speech! Think about it!)