Open Shakespeare Blog

OCRing Shakespeare Entry from Encyclopaedia Britannica 11th Edition

One of next things we want to do for open shakespeare is provide an open introduction for to his works. The obvious idea for this was to use the Shakespeare entry in the 11th ed of the Encyclopaedia Britannica as detailed in this ticket:

http://p.knowledgeforge.net/shakespeare/trac/ticket/24

We’ve now written code to grab the relevant tiffs off wikimedia:

http://p.knowledgeforge.net/shakespeare/svn/trunk/src/shakespeare/src/eb.py

You can also find them online (28 pages) starting at:

http://upload.wikimedia.org/wikipedia/commons/scans/EB1911_tiff/VOL24%20SAINTE-CLAIRE%20DEVILLE-SHUTTLE/ED4A800.TIF

Next step is to then OCR this stuff (after that we can move on to proofing whether by ourselves or via http://pgdp.net). When we first had a stab at this back in April we tried using gocr. Unfortunately the results were so bad that they were unusable. Recently an old ocr engine of HP’s has been released as open source under the name of tesseract:

http://code.google.com/p/tesseract-ocr/

We’re going to have a go using this — though if there is anyone out there with access to an alternative system we’d love to hear about it.


Posted: August 14th, 2007 | Author: admin | Filed under: Technical, Texts | No Comments »

Leave a Reply

http://www.openshakespeare.org/

Pages

  • 1. What is Open Shakespeare?
  • 2. How do I use Open Shakespeare?
  • 3. Get Involved
  • 4. Team
  • 5. ‘The Marriage of Text and Technology’
  • About Us

Blogroll

  • Free Culture UK
  • Open Knowledge Foundation
An Open Knowledge Foundation Project | Contact Us | (c) Open Knowledge Foundation
All material available under CC 'by' license v3.0 (all jurisdictions) | This Content and Data is Open

Wordpress theme based on Clean Home. Login.