Re: OCR challenge with prizes for open source project - starting in two weeks

Jimmy O'Regan Mon, 05 Jul 2010 14:24:55 -0700

On 5 July 2010 17:06, Ian Ozsvald (A.I. Cookbook) <[email protected]> wrote:
> I'd like to announce an OCR challenge that will start soon, it is for
> an open source project and will include prizes.
>
> I'm a part of the http://OpenPlaques.org/ project, the site collects
> flickr images of commemorative plaques which are manually transcribed
> and added to the site. Plaques generally have 20-100 words which
> describe a historic situation - the words are clear, often in white on
> a blue background. These plaques exist all over the UK (and in many
> other countries). The goal of the project is to make these historic
> locations easily searchable.
>


We have them in Ireland too. One aspect of them that you can use is
that they tend to adhere to a set of 'template' phrases - Person was
born here, site of the battle of X, etc.

> Here's an example entry for Sir Whinston Churchill near me:
> http://www.openplaques.org/plaques/990
>
> The project founders *manually* transcribe the plaque photos at
> present - this is a crazy situation as they have several thousand
> plaques outstanding and more are added every day. The project is now
> international (it started in the UK less than a year ago) and an
> automatic transcription system is sorely needed.
>

Not so crazy: if you already have a corpus of existing transcriptions,
that puts you in a position to use statistical post-editing
techniques.

There is a tool for statistical post-editing here:
http://www.cs.toronto.edu/~mreimer/tesseract.html
If you have the option, I'd recommend changing it to output a word
lattice, and feed that into an n-gram language model: IRSTLM is a good
open source toolset for n-gram language modelling.

> As a part of my play-time projects I've setup an Artificial
> Intelligence Cookbook site where I'm building a community of
> like-minded folk who like solving interesting challenges. I've already
> documented a work-in-progress report on a manual solution to this
> problem using tesseract 2:
> http://blog.aicookbook.com/2010/06/optical-character-recognition-webservice-work-in-progress/
> and I've just posted a software outline in Python for (bad!) automatic
> recognition:
> http://blog.aicookbook.com/2010/07/automatic-plaque-transcription-using-python-work-in-progress/
>
> The OpenPlaques project are building a corpus of images with
> transcriptions for me, once we have a good set of images I'll begin
> the challenge. This should be in the next two weeks.
>
> You can see my demo code and a suggested solution here:
> http://aicookbook.com/wiki/Automatic_plaque_transcription
> and I'm *very* open to feedback in our Google Group:
> http://groups.google.com/group/aicookbook
>

It very much looks like you're still brainstorming: if absolute
accuracy is your goal (and processing time, etc., are not so much of
an issue) one thing that would work for plaques is this: many newer
cameras add geotags to the EXIF tags. You can use the geotags to query
DBpedia for a list of Wikipedia articles that pertain to a particular
place, and extract a custom dictionary (and/or language model) for
that place -- if there's a plaque commemorating something, then it's
quite likely that Wikipedia mentions it. Names in particular tend to
be quite problematic for OCR, this way you can generate custom
wordlists that have a higher likelihood of containing those names. You
should get good enough results from Tesseract (or, indeed, any OCR
system) by passing such a list as the user dictionary, but if you
decide to use statistical language models you would also need those
words to be part of it, to avoid out-of-vocabulary errors. (IRST
supports interpolating from numerous individual models, so that's not
a problem).

> I'll run the competition for several months with a prize for the best
> solution each month. Solutions get open sourced and sooner or later a
> good automatic solution will be created which can start automatically
> transcribing the OpenPlaques corpus of images. Winners will also get
> their name listed on the OpenPlaques site.
>
> If you'd like to test your skills with OCR then you'll find a good
> range of images to work on - from simple clean shots to angled, dark,
> smudged images of weather-beaten plaques taken at a distance.
>
> Cheers,
> Ian.
>
> --
> Ian Ozsvald (A.I. researcher, screencaster)
> [email protected]
>
> http://IanOzsvald.com
> http://MorConsulting.com/
> http://blog.AICookbook.com/
> http://TheScreencastingHandbook.com
> http://FivePoundApp.com/
> http://twitter.com/IanOzsvald
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>



-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: OCR challenge with prizes for open source project - starting in two weeks

Reply via email to