Open Shakespeare Blog

XML and the Natural Language Toolkit

I’ve been playing with the nltk (natural language toolkit) and the really useful Jon Bosak xml annotated corpus these days,  and  this are some of the graphs I’ve been able to parse after analyzing the speech of the main characters of the play (characters that say more than 100 lines of code:

exclamations and interrogations

exclamations and interrogations

Here we can see that Macduff is screaming a lot, and that when everybody talks is never to question, but to assert… Poor Macbeth and Lady Macduff question everything, while Lady Macbeth just as much as asserting.

Regarding amount of words in the play, by far Macbeth is the one that talks more:

amount of words spoken by main characters

amount of words spoken by main characters

But what about lexical variety? In this next graph, we can see the variety of the words:

Macbeth - lexical variety

Macbeth - lexical variety

Here we can see the variety of characters speech.

The brown-ish words are said just once per character. The light greens are word that will repeat on their speech, and the dark greens are repetitions of the light green words. I still need to take more measures to see if this is actually the way everybody speaks: by repeating a lot of small words with just some new words once in a while. (There are more words that appear just once, than the words you will repeat through most of your speech! Think about it!)


Posted: February 26th, 2010 | Author: adalovelace | Filed under: Technical, Texts | 2 Comments »

http://www.openshakespeare.org/

Pages

  • 1. What is Open Shakespeare?
  • 2. How do I use Open Shakespeare?
  • 3. Get Involved
  • 4. Team
  • 5. ‘The Marriage of Text and Technology’
  • About Us

Blogroll

  • Free Culture UK
  • Open Knowledge Foundation
An Open Knowledge Foundation Project | Contact Us | (c) Open Knowledge Foundation
All material available under CC 'by' license v3.0 (all jurisdictions) | This Content and Data is Open

Wordpress theme based on Clean Home. Login.