Open Shakespeare Blog

XML and the Natural Language Toolkit

I’ve been playing with the nltk (natural language toolkit) and the really useful Jon Bosak xml annotated corpus these days,  and  this are some of the graphs I’ve been able to parse after analyzing the speech of the main characters of the play (characters that say more than 100 lines of code:

exclamations and interrogations

exclamations and interrogations

Here we can see that Macduff is screaming a lot, and that when everybody talks is never to question, but to assert… Poor Macbeth and Lady Macduff question everything, while Lady Macbeth just as much as asserting.

Regarding amount of words in the play, by far Macbeth is the one that talks more:

amount of words spoken by main characters

amount of words spoken by main characters

But what about lexical variety? In this next graph, we can see the variety of the words:

Macbeth - lexical variety

Macbeth - lexical variety

Here we can see the variety of characters speech.

The brown-ish words are said just once per character. The light greens are word that will repeat on their speech, and the dark greens are repetitions of the light green words. I still need to take more measures to see if this is actually the way everybody speaks: by repeating a lot of small words with just some new words once in a while. (There are more words that appear just once, than the words you will repeat through most of your speech! Think about it!)


Posted: February 26th, 2010 | Author: adalovelace | Filed under: Technical, Texts | 2 Comments »

2 Comments on “XML and the Natural Language Toolkit”

  1. 1 Ingrid said at 6:10 pm on March 7th, 2010:

    Hi there,

    Thanks for the graphs – they are very interesting!

    I’m curious, in the first graph on exclamations and interrogations, MacBeth has shorter bars than, say MacDuff despite having more lines, so I’m guessing you normalised in some way.

    Did you look at the ratio of the number of ‘!’/'?’ characters in a character’s speech to the number of words, lines or complete sentences spoken by the character – or did you do something else entirely?

    Many thanks again!

  2. 2 adalovelace said at 12:00 pm on March 10th, 2010:

    Ingrid: the answer is very simple:
    I just counted the ‘!’s and ‘?’s per character, and MacDuff may not talk much, but when he does is with exclamation signs… where Macbeth is less prone to use them.
    I haven’t parsed the ratio of them per word… that is a good idea for next graphs!
    The nltk scripts I’ve used are in the contrib folder on the project sourcecode: http://knowledgeforge.net/shakespeare/hg/log?rev=nltk

    Thanks for the idea Ingrid, I will see which other numbers I can infer…


Leave a Reply

http://www.openshakespeare.org/

Pages

  • 1. What is Open Shakespeare?
  • 2. How do I use Open Shakespeare?
  • 3. Get Involved
  • 4. Team
  • 5. ‘The Marriage of Text and Technology’
  • About Us

Blogroll

  • Free Culture UK
  • Open Knowledge Foundation
An Open Knowledge Foundation Project | Contact Us | (c) Open Knowledge Foundation
All material available under CC 'by' license v3.0 (all jurisdictions) | This Content and Data is Open

Wordpress theme based on Clean Home. Login.