XML and the Natural Language Toolkit
I’ve been playing with the nltk (natural language toolkit) and the really useful Jon Bosak xml annotated corpus these days, and this are some of the graphs I’ve been able to parse after analyzing the speech of the main characters of the play (characters that say more than 100 lines of code:
Here we can see that Macduff is screaming a lot, and that when everybody talks is never to question, but to assert… Poor Macbeth and Lady Macduff question everything, while Lady Macbeth just as much as asserting.
Regarding amount of words in the play, by far Macbeth is the one that talks more:
But what about lexical variety? In this next graph, we can see the variety of the words:
Here we can see the variety of characters speech.
The brown-ish words are said just once per character. The light greens are word that will repeat on their speech, and the dark greens are repetitions of the light green words. I still need to take more measures to see if this is actually the way everybody speaks: by repeating a lot of small words with just some new words once in a while. (There are more words that appear just once, than the words you will repeat through most of your speech! Think about it!)