Re: OCR challenge with prizes for open source project - starting in two weeks

Ian Ozsvald (A.I. Cookbook) Thu, 26 Aug 2010 23:19:17 -0700

This is a quick update on the challenge - one of my collaborators has
posted an update which brings our average error down to 33 characters
per plaque (and in so doing he wins this month's prize). The error is
still too high but he's brought in a nice blue-region detector which
lets us isolate the right region of the image:
http://blog.aicookbook.com/2010/08/automatic-plaque-transcription-pytesseract-average-error-down-to-33-4/


I'm planning on presenting our results at an Open Day for the
OpenPlaques project at the end of September, I'm hoping to put in some
time on the project in the next few weeks. Cleaning up the recognised
result (with e.g. Jimmy's n-gram suggestion) will soon be on the
agenda.

If anyone here is interested in contributing there's an on-going £25
monthly prize for the best open-source solution to the problem.

Cheers,
Ian.

On 7 July 2010 20:26, Jimmy O'Regan <[email protected]> wrote:
> On 6 July 2010 10:09, Ian Ozsvald (A.I. Cookbook) <[email protected]> wrote:
>> Hi Jimmy. Thanks for the ideas.
>>
>> The plaques do crop up everywhere - I got some (square stone ones) in
>> Barcelona last year. I want to focus on the blue English Heritage ones
>> first as they represent a large part of the corpus and are most common
>> in the UK.
>>
>> Re. the template phrases - absolutely. A lot of them appear to have
>> sensible phrase types along with good named entities. I like the idea
>> of adding geotags into the mix, I had been mulling the idea of
>> searching for the named entites in Freebase and perhaps voting on the
>> results.
>>
>> The Freebase/WikiPedia links are useful too as they'd provide extra
>> annotations for OpenPlaques in the final result.
>>
>> Re. 'crazy' - only crazy as in "the sysadmins time is spent mostly
>> typing in plaques rather than working on ways of getting other
>> people/tools to enter the plaques in a more scaleable fashion" :-)
>>
>> Re. statistical techniques - good idea but that's out of my world of
>> experience. OCR is also recent for me, I only started playing with it
>> this year. A part of the reason for putting up prizes behind this
>> challenge is to see how people crack this particular nut. I'll see if
>> I can do some reading via the link, cheers.
>>
>
> The statistical stuff is actually quite easy, in this case: very
> similar to the task of statistical spell checking. This article:
> http://norvig.com/spell-correct.html is a good introduction.
>
> Statistical post-editing for OCR usually only works with character
> n-grams: you can correct 'ehurch' to 'church' (as long as the
> corrector is set to look for 'e' where there should be 'c') simply
> because <START_OF_WORD> 'e' 'h' is an extremely unlikely combination,
> whereas <START_OF_WORD> 'c' 'h' will have a probability approaching
> 1.0; this only helps with ambiguous characters in unambiguous words,
> though; an n-gram language model does the same on a word level: 'are'
> and 'arc' are both valid words, so character statistics are no real
> help, but the statistics from the preceding words are: 'the are' is
> extremely unlikely, while 'the arc' is quite likely.
>
>> Re. dictionary - I'd overlooked that and was thinking of ways of using
>> the badly recognised text to vote on 'words that are more likely to be
>> good replacements'. I'll experiment with the dictionary using some of
>> the marked-up plaques.
>>
>> Much obliged,
>> Ian.
>>
>> On 5 July 2010 22:24, Jimmy O'Regan <[email protected]> wrote:
>>> On 5 July 2010 17:06, Ian Ozsvald (A.I. Cookbook) <[email protected]> 
>>> wrote:
>>>> I'd like to announce an OCR challenge that will start soon, it is for
>>>> an open source project and will include prizes.
>>>>
>>>> I'm a part of the http://OpenPlaques.org/ project, the site collects
>>>> flickr images of commemorative plaques which are manually transcribed
>>>> and added to the site. Plaques generally have 20-100 words which
>>>> describe a historic situation - the words are clear, often in white on
>>>> a blue background. These plaques exist all over the UK (and in many
>>>> other countries). The goal of the project is to make these historic
>>>> locations easily searchable.
>>>>
>>>
>>> We have them in Ireland too. One aspect of them that you can use is
>>> that they tend to adhere to a set of 'template' phrases - Person was
>>> born here, site of the battle of X, etc.
>>>
>>>> Here's an example entry for Sir Whinston Churchill near me:
>>>> http://www.openplaques.org/plaques/990
>>>>
>>>> The project founders *manually* transcribe the plaque photos at
>>>> present - this is a crazy situation as they have several thousand
>>>> plaques outstanding and more are added every day. The project is now
>>>> international (it started in the UK less than a year ago) and an
>>>> automatic transcription system is sorely needed.
>>>>
>>>
>>> Not so crazy: if you already have a corpus of existing transcriptions,
>>> that puts you in a position to use statistical post-editing
>>> techniques.
>>>
>>> There is a tool for statistical post-editing here:
>>> http://www.cs.toronto.edu/~mreimer/tesseract.html
>>> If you have the option, I'd recommend changing it to output a word
>>> lattice, and feed that into an n-gram language model: IRSTLM is a good
>>> open source toolset for n-gram language modelling.
>>>
>>>> As a part of my play-time projects I've setup an Artificial
>>>> Intelligence Cookbook site where I'm building a community of
>>>> like-minded folk who like solving interesting challenges. I've already
>>>> documented a work-in-progress report on a manual solution to this
>>>> problem using tesseract 2:
>>>> http://blog.aicookbook.com/2010/06/optical-character-recognition-webservice-work-in-progress/
>>>> and I've just posted a software outline in Python for (bad!) automatic
>>>> recognition:
>>>> http://blog.aicookbook.com/2010/07/automatic-plaque-transcription-using-python-work-in-progress/
>>>>
>>>> The OpenPlaques project are building a corpus of images with
>>>> transcriptions for me, once we have a good set of images I'll begin
>>>> the challenge. This should be in the next two weeks.
>>>>
>>>> You can see my demo code and a suggested solution here:
>>>> http://aicookbook.com/wiki/Automatic_plaque_transcription
>>>> and I'm *very* open to feedback in our Google Group:
>>>> http://groups.google.com/group/aicookbook
>>>>
>>>
>>> It very much looks like you're still brainstorming: if absolute
>>> accuracy is your goal (and processing time, etc., are not so much of
>>> an issue) one thing that would work for plaques is this: many newer
>>> cameras add geotags to the EXIF tags. You can use the geotags to query
>>> DBpedia for a list of Wikipedia articles that pertain to a particular
>>> place, and extract a custom dictionary (and/or language model) for
>>> that place -- if there's a plaque commemorating something, then it's
>>> quite likely that Wikipedia mentions it. Names in particular tend to
>>> be quite problematic for OCR, this way you can generate custom
>>> wordlists that have a higher likelihood of containing those names. You
>>> should get good enough results from Tesseract (or, indeed, any OCR
>>> system) by passing such a list as the user dictionary, but if you
>>> decide to use statistical language models you would also need those
>>> words to be part of it, to avoid out-of-vocabulary errors. (IRST
>>> supports interpolating from numerous individual models, so that's not
>>> a problem).
>>>
>>>> I'll run the competition for several months with a prize for the best
>>>> solution each month. Solutions get open sourced and sooner or later a
>>>> good automatic solution will be created which can start automatically
>>>> transcribing the OpenPlaques corpus of images. Winners will also get
>>>> their name listed on the OpenPlaques site.
>>>>
>>>> If you'd like to test your skills with OCR then you'll find a good
>>>> range of images to work on - from simple clean shots to angled, dark,
>>>> smudged images of weather-beaten plaques taken at a distance.
>>>>
>>>> Cheers,
>>>> Ian.
>>>>
>>>> --
>>>> Ian Ozsvald (A.I. researcher, screencaster)
>>>> [email protected]
>>>>
>>>> http://IanOzsvald.com
>>>> http://MorConsulting.com/
>>>> http://blog.AICookbook.com/
>>>> http://TheScreencastingHandbook.com
>>>> http://FivePoundApp.com/
>>>> http://twitter.com/IanOzsvald
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to 
>>>> [email protected].
>>>> For more options, visit this group at 
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> <Leftmost> jimregan, that's because deep inside you, you are evil.
>>> <Leftmost> Also not-so-deep inside you.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups 
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to 
>>> [email protected].
>>> For more options, visit this group at 
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>>
>>
>>
>>
>> --
>> Ian Ozsvald (A.I. researcher, screencaster)
>> [email protected]
>>
>> http://IanOzsvald.com
>> http://MorConsulting.com/
>> http://blog.AICookbook.com/
>> http://TheScreencastingHandbook.com
>> http://FivePoundApp.com/
>> http://twitter.com/IanOzsvald
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to 
>> [email protected].
>> For more options, visit this group at 
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>>
>
>
>
> --
> <Leftmost> jimregan, that's because deep inside you, you are evil.
> <Leftmost> Also not-so-deep inside you.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>



-- 
Ian Ozsvald (A.I. researcher, screencaster)
[email protected]

http://IanOzsvald.com
http://MorConsulting.com/
http://blog.AICookbook.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: OCR challenge with prizes for open source project - starting in two weeks

Reply via email to