Since the previous post we’ve succeeded in using tesseract and
we now have a nice plain text version of the EB entry on shakespeare:
http://knowledgeforge.net/shakespeare/svn/trunk/shksprdata/ancillary/britannica-11th.txt
What we now need to do is ‘proof’ this to correct the OCR errors. This
kind of think is perfect for distributed volunteers so if you’d like to
help out just step up and starting correcting with one of the sections. To make it especially easy for people to make edits the text has in a temporary location on the Open Knowledge Foundation wiki (only the first five pages for the time being):
http://okfn.org/wiki/tmp/BritannicaShakespeare
September 19th, 2007
One of next things we want to do for open shakespeare is provide an open
introduction for to his works. The obvious idea for this was to use the
Shakespeare entry in the 11th ed of the Encyclopaedia Britannica as
detailed in this ticket:
http://p.knowledgeforge.net/shakespeare/trac/ticket/24
We’ve now written code to grab the relevant tiffs off wikimedia:
http://p.knowledgeforge.net/shakespeare/svn/trunk/src/shakespeare/src/eb.py
You can also find them online (28 pages) starting at:
http://upload.wikimedia.org/wikipedia/commons/scans/EB1911_tiff/VOL24%20SAINTE-CLAIRE%20DEVILLE-SHUTTLE/ED4A800.TIF
Next step is to then OCR this stuff (after that we can move on to
proofing whether by ourselves or via http://pgdp.net). When we first had
a stab at this back in April we tried using gocr. Unfortunately the
results were so bad that they were unusable. Recently an old ocr engine
of HP’s has been released as open source under the name of tesseract:
http://code.google.com/p/tesseract-ocr/
We’re going to have a go using this — though if there is anyone out there with access to an alternative system we’d love to hear about it.
August 14th, 2007