Re: OCR challenge with prizes for open source project - starting in two weeks

Jimmy O'Regan Wed, 07 Jul 2010 12:27:05 -0700

On 6 July 2010 10:09, Ian Ozsvald (A.I. Cookbook) <[email protected]> wrote:
> Hi Jimmy. Thanks for the ideas.
>
> The plaques do crop up everywhere - I got some (square stone ones) in
> Barcelona last year. I want to focus on the blue English Heritage ones
> first as they represent a large part of the corpus and are most common
> in the UK.
>
> Re. the template phrases - absolutely. A lot of them appear to have
> sensible phrase types along with good named entities. I like the idea
> of adding geotags into the mix, I had been mulling the idea of
> searching for the named entites in Freebase and perhaps voting on the
> results.
>
> The Freebase/WikiPedia links are useful too as they'd provide extra
> annotations for OpenPlaques in the final result.
>
> Re. 'crazy' - only crazy as in "the sysadmins time is spent mostly
> typing in plaques rather than working on ways of getting other
> people/tools to enter the plaques in a more scaleable fashion" :-)
>
> Re. statistical techniques - good idea but that's out of my world of
> experience. OCR is also recent for me, I only started playing with it
> this year. A part of the reason for putting up prizes behind this
> challenge is to see how people crack this particular nut. I'll see if
> I can do some reading via the link, cheers.
>


The statistical stuff is actually quite easy, in this case: very
similar to the task of statistical spell checking. This article:
http://norvig.com/spell-correct.html is a good introduction.

Statistical post-editing for OCR usually only works with character
n-grams: you can correct 'ehurch' to 'church' (as long as the
corrector is set to look for 'e' where there should be 'c') simply
because <START_OF_WORD> 'e' 'h' is an extremely unlikely combination,
whereas <START_OF_WORD> 'c' 'h' will have a probability approaching
1.0; this only helps with ambiguous characters in unambiguous words,
though; an n-gram language model does the same on a word level: 'are'
and 'arc' are both valid words, so character statistics are no real
help, but the statistics from the preceding words are: 'the are' is
extremely unlikely, while 'the arc' is quite likely.

> Re. dictionary - I'd overlooked that and was thinking of ways of using
> the badly recognised text to vote on 'words that are more likely to be
> good replacements'. I'll experiment with the dictionary using some of
> the marked-up plaques.
>
> Much obliged,
> Ian.
>
> On 5 July 2010 22:24, Jimmy O'Regan <[email protected]> wrote:
>> On 5 July 2010 17:06, Ian Ozsvald (A.I. Cookbook) <[email protected]> 
>> wrote:
>>> I'd like to announce an OCR challenge that will start soon, it is for
>>> an open source project and will include prizes.
>>>
>>> I'm a part of the http://OpenPlaques.org/ project, the site collects
>>> flickr images of commemorative plaques which are manually transcribed
>>> and added to the site. Plaques generally have 20-100 words which
>>> describe a historic situation - the words are clear, often in white on
>>> a blue background. These plaques exist all over the UK (and in many
>>> other countries). The goal of the project is to make these historic
>>> locations easily searchable.
>>>
>>
>> We have them in Ireland too. One aspect of them that you can use is
>> that they tend to adhere to a set of 'template' phrases - Person was
>> born here, site of the battle of X, etc.
>>
>>> Here's an example entry for Sir Whinston Churchill near me:
>>> http://www.openplaques.org/plaques/990
>>>
>>> The project founders *manually* transcribe the plaque photos at
>>> present - this is a crazy situation as they have several thousand
>>> plaques outstanding and more are added every day. The project is now
>>> international (it started in the UK less than a year ago) and an
>>> automatic transcription system is sorely needed.
>>>
>>
>> Not so crazy: if you already have a corpus of existing transcriptions,
>> that puts you in a position to use statistical post-editing
>> techniques.
>>
>> There is a tool for statistical post-editing here:
>> http://www.cs.toronto.edu/~mreimer/tesseract.html
>> If you have the option, I'd recommend changing it to output a word
>> lattice, and feed that into an n-gram language model: IRSTLM is a good
>> open source toolset for n-gram language modelling.
>>
>>> As a part of my play-time projects I've setup an Artificial
>>> Intelligence Cookbook site where I'm building a community of
>>> like-minded folk who like solving interesting challenges. I've already
>>> documented a work-in-progress report on a manual solution to this
>>> problem using tesseract 2:
>>> http://blog.aicookbook.com/2010/06/optical-character-recognition-webservice-work-in-progress/
>>> and I've just posted a software outline in Python for (bad!) automatic
>>> recognition:
>>> http://blog.aicookbook.com/2010/07/automatic-plaque-transcription-using-python-work-in-progress/
>>>
>>> The OpenPlaques project are building a corpus of images with
>>> transcriptions for me, once we have a good set of images I'll begin
>>> the challenge. This should be in the next two weeks.
>>>
>>> You can see my demo code and a suggested solution here:
>>> http://aicookbook.com/wiki/Automatic_plaque_transcription
>>> and I'm *very* open to feedback in our Google Group:
>>> http://groups.google.com/group/aicookbook
>>>
>>
>> It very much looks like you're still brainstorming: if absolute
>> accuracy is your goal (and processing time, etc., are not so much of
>> an issue) one thing that would work for plaques is this: many newer
>> cameras add geotags to the EXIF tags. You can use the geotags to query
>> DBpedia for a list of Wikipedia articles that pertain to a particular
>> place, and extract a custom dictionary (and/or language model) for
>> that place -- if there's a plaque commemorating something, then it's
>> quite likely that Wikipedia mentions it. Names in particular tend to
>> be quite problematic for OCR, this way you can generate custom
>> wordlists that have a higher likelihood of containing those names. You
>> should get good enough results from Tesseract (or, indeed, any OCR
>> system) by passing such a list as the user dictionary, but if you
>> decide to use statistical language models you would also need those
>> words to be part of it, to avoid out-of-vocabulary errors. (IRST
>> supports interpolating from numerous individual models, so that's not
>> a problem).
>>
>>> I'll run the competition for several months with a prize for the best
>>> solution each month. Solutions get open sourced and sooner or later a
>>> good automatic solution will be created which can start automatically
>>> transcribing the OpenPlaques corpus of images. Winners will also get
>>> their name listed on the OpenPlaques site.
>>>
>>> If you'd like to test your skills with OCR then you'll find a good
>>> range of images to work on - from simple clean shots to angled, dark,
>>> smudged images of weather-beaten plaques taken at a distance.
>>>
>>> Cheers,
>>> Ian.
>>>
>>> --
>>> Ian Ozsvald (A.I. researcher, screencaster)
>>> [email protected]
>>>
>>> http://IanOzsvald.com
>>> http://MorConsulting.com/
>>> http://blog.AICookbook.com/
>>> http://TheScreencastingHandbook.com
>>> http://FivePoundApp.com/
>>> http://twitter.com/IanOzsvald
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups 
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to 
>>> [email protected].
>>> For more options, visit this group at 
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>>
>>
>>
>>
>> --
>> <Leftmost> jimregan, that's because deep inside you, you are evil.
>> <Leftmost> Also not-so-deep inside you.
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to 
>> [email protected].
>> For more options, visit this group at 
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>>
>
>
>
> --
> Ian Ozsvald (A.I. researcher, screencaster)
> [email protected]
>
> http://IanOzsvald.com
> http://MorConsulting.com/
> http://blog.AICookbook.com/
> http://TheScreencastingHandbook.com
> http://FivePoundApp.com/
> http://twitter.com/IanOzsvald
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>



-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: OCR challenge with prizes for open source project - starting in two weeks

Reply via email to