Re: OCR challenge with prizes for open source project - starting in two weeks

Ian Ozsvald (A.I. Cookbook) Wed, 07 Jul 2010 08:13:27 -0700

Hi Jimmy. Thanks for the ideas.

The plaques do crop up everywhere - I got some (square stone ones) in
Barcelona last year. I want to focus on the blue English Heritage ones
first as they represent a large part of the corpus and are most common
in the UK.


Re. the template phrases - absolutely. A lot of them appear to have
sensible phrase types along with good named entities. I like the idea
of adding geotags into the mix, I had been mulling the idea of
searching for the named entites in Freebase and perhaps voting on the
results.

The Freebase/WikiPedia links are useful too as they'd provide extra
annotations for OpenPlaques in the final result.

Re. 'crazy' - only crazy as in "the sysadmins time is spent mostly
typing in plaques rather than working on ways of getting other
people/tools to enter the plaques in a more scaleable fashion" :-)

Re. statistical techniques - good idea but that's out of my world of
experience. OCR is also recent for me, I only started playing with it
this year. A part of the reason for putting up prizes behind this
challenge is to see how people crack this particular nut. I'll see if
I can do some reading via the link, cheers.

Re. dictionary - I'd overlooked that and was thinking of ways of using
the badly recognised text to vote on 'words that are more likely to be
good replacements'. I'll experiment with the dictionary using some of
the marked-up plaques.

Much obliged,
Ian.

On 5 July 2010 22:24, Jimmy O'Regan <[email protected]> wrote:
> On 5 July 2010 17:06, Ian Ozsvald (A.I. Cookbook) <[email protected]> wrote:
>> I'd like to announce an OCR challenge that will start soon, it is for
>> an open source project and will include prizes.
>>
>> I'm a part of the http://OpenPlaques.org/ project, the site collects
>> flickr images of commemorative plaques which are manually transcribed
>> and added to the site. Plaques generally have 20-100 words which
>> describe a historic situation - the words are clear, often in white on
>> a blue background. These plaques exist all over the UK (and in many
>> other countries). The goal of the project is to make these historic
>> locations easily searchable.
>>
>
> We have them in Ireland too. One aspect of them that you can use is
> that they tend to adhere to a set of 'template' phrases - Person was
> born here, site of the battle of X, etc.
>
>> Here's an example entry for Sir Whinston Churchill near me:
>> http://www.openplaques.org/plaques/990
>>
>> The project founders *manually* transcribe the plaque photos at
>> present - this is a crazy situation as they have several thousand
>> plaques outstanding and more are added every day. The project is now
>> international (it started in the UK less than a year ago) and an
>> automatic transcription system is sorely needed.
>>
>
> Not so crazy: if you already have a corpus of existing transcriptions,
> that puts you in a position to use statistical post-editing
> techniques.
>
> There is a tool for statistical post-editing here:
> http://www.cs.toronto.edu/~mreimer/tesseract.html
> If you have the option, I'd recommend changing it to output a word
> lattice, and feed that into an n-gram language model: IRSTLM is a good
> open source toolset for n-gram language modelling.
>
>> As a part of my play-time projects I've setup an Artificial
>> Intelligence Cookbook site where I'm building a community of
>> like-minded folk who like solving interesting challenges. I've already
>> documented a work-in-progress report on a manual solution to this
>> problem using tesseract 2:
>> http://blog.aicookbook.com/2010/06/optical-character-recognition-webservice-work-in-progress/
>> and I've just posted a software outline in Python for (bad!) automatic
>> recognition:
>> http://blog.aicookbook.com/2010/07/automatic-plaque-transcription-using-python-work-in-progress/
>>
>> The OpenPlaques project are building a corpus of images with
>> transcriptions for me, once we have a good set of images I'll begin
>> the challenge. This should be in the next two weeks.
>>
>> You can see my demo code and a suggested solution here:
>> http://aicookbook.com/wiki/Automatic_plaque_transcription
>> and I'm *very* open to feedback in our Google Group:
>> http://groups.google.com/group/aicookbook
>>
>
> It very much looks like you're still brainstorming: if absolute
> accuracy is your goal (and processing time, etc., are not so much of
> an issue) one thing that would work for plaques is this: many newer
> cameras add geotags to the EXIF tags. You can use the geotags to query
> DBpedia for a list of Wikipedia articles that pertain to a particular
> place, and extract a custom dictionary (and/or language model) for
> that place -- if there's a plaque commemorating something, then it's
> quite likely that Wikipedia mentions it. Names in particular tend to
> be quite problematic for OCR, this way you can generate custom
> wordlists that have a higher likelihood of containing those names. You
> should get good enough results from Tesseract (or, indeed, any OCR
> system) by passing such a list as the user dictionary, but if you
> decide to use statistical language models you would also need those
> words to be part of it, to avoid out-of-vocabulary errors. (IRST
> supports interpolating from numerous individual models, so that's not
> a problem).
>
>> I'll run the competition for several months with a prize for the best
>> solution each month. Solutions get open sourced and sooner or later a
>> good automatic solution will be created which can start automatically
>> transcribing the OpenPlaques corpus of images. Winners will also get
>> their name listed on the OpenPlaques site.
>>
>> If you'd like to test your skills with OCR then you'll find a good
>> range of images to work on - from simple clean shots to angled, dark,
>> smudged images of weather-beaten plaques taken at a distance.
>>
>> Cheers,
>> Ian.
>>
>> --
>> Ian Ozsvald (A.I. researcher, screencaster)
>> [email protected]
>>
>> http://IanOzsvald.com
>> http://MorConsulting.com/
>> http://blog.AICookbook.com/
>> http://TheScreencastingHandbook.com
>> http://FivePoundApp.com/
>> http://twitter.com/IanOzsvald
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to 
>> [email protected].
>> For more options, visit this group at 
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>>
>
>
>
> --
> <Leftmost> jimregan, that's because deep inside you, you are evil.
> <Leftmost> Also not-so-deep inside you.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>



-- 
Ian Ozsvald (A.I. researcher, screencaster)
[email protected]

http://IanOzsvald.com
http://MorConsulting.com/
http://blog.AICookbook.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: OCR challenge with prizes for open source project - starting in two weeks

Reply via email to