Re: OCR challenge with prizes for open source project - starting in two weeks

Jimmy O'Regan Wed, 07 Jul 2010 11:22:44 -0700

On 6 July 2010 22:41, Ian Ozsvald (A.I. Cookbook) <[email protected]> wrote:
> Thanks for the geo-location idea, I've updated the wiki having tried
> wikilocation.org
> http://aicookbook.com/wiki/Automatic_plaque_transcription
>
> I've noticed that some of the relevant pages in WikiPedia aren't
> geo-tagged but they're linked from geo-tagged pages, or direct
> searches (perhaps with two passes on tesseract) sometimes reveal
> useful pages. There's certainly good source material to use here.
>


Yes; that's why I mentioned DBpedia. DBpedia uses Wikipedia's
structure to extract semantic data (RDF) and provides a public Sparql
endpoint to query - I'll assume that, as you've mentioned AI, you
either know what I'm talking about, or these will be easy concepts for
you to pick up :)

There's also the geonames dataset, to allow reverse geolookup. I think
Freebase contains both geonames and dbpedia, so you're covered.


> I've tried your suggestion of adding some dictionary words but that
> didn't change the quality of recognition. I edited:
> /usr/local/share/tessdata/eng.user-words
> which had 925 lines of data already (I'm using tesseract 3 via svn
> built last night with 'sudo make install' on my MacBook).
>
> I confirmed that TESSDATA_PREFIX points at this location (and made it
> go elsewhere just to check that tesseract reported an error).
>
> Having added:
> 1866
> Gold
> Albert
> Medal
> posthumously
> 1881
> when recognising a black and white, thresholded version of:
> http://www.flickr.com/photos/54145...@n00/4701399020/
> it still fails to recognise the above words (I added these after the
> first run of tesseract, they were the poorest recognised words). There
> is no difference in the output file before/after adding these lines.
>
> Am I doing something silly?
>

Well, adding /those/ words shouldn't make much of a difference - they
should be in the normal dictionaries - the suggestion was intended
more for less common proper names.

Also, Tesseract uses an adaptive classifier, so concatenating multiple
images together (say, a multipage tiff) should gave much better
results than running on each page individually, but it occurs to me
now that persisting the classifier's state would be useful in a
variety of other areas, such as business card scanning.

> Cheers,
> Ian.
>
> On 6 July 2010 10:09, Ian Ozsvald (A.I. Cookbook) <[email protected]> wrote:
>> Hi Jimmy. Thanks for the ideas.
>>
>> The plaques do crop up everywhere - I got some (square stone ones) in
>> Barcelona last year. I want to focus on the blue English Heritage ones
>> first as they represent a large part of the corpus and are most common
>> in the UK.
>>
>> Re. the template phrases - absolutely. A lot of them appear to have
>> sensible phrase types along with good named entities. I like the idea
>> of adding geotags into the mix, I had been mulling the idea of
>> searching for the named entites in Freebase and perhaps voting on the
>> results.
>>
>> The Freebase/WikiPedia links are useful too as they'd provide extra
>> annotations for OpenPlaques in the final result.
>>
>> Re. 'crazy' - only crazy as in "the sysadmins time is spent mostly
>> typing in plaques rather than working on ways of getting other
>> people/tools to enter the plaques in a more scaleable fashion" :-)
>>
>> Re. statistical techniques - good idea but that's out of my world of
>> experience. OCR is also recent for me, I only started playing with it
>> this year. A part of the reason for putting up prizes behind this
>> challenge is to see how people crack this particular nut. I'll see if
>> I can do some reading via the link, cheers.
>>
>> Re. dictionary - I'd overlooked that and was thinking of ways of using
>> the badly recognised text to vote on 'words that are more likely to be
>> good replacements'. I'll experiment with the dictionary using some of
>> the marked-up plaques.
>>
>> Much obliged,
>> Ian.
>>
>> On 5 July 2010 22:24, Jimmy O'Regan <[email protected]> wrote:
>>> On 5 July 2010 17:06, Ian Ozsvald (A.I. Cookbook) <[email protected]> 
>>> wrote:
>>>> I'd like to announce an OCR challenge that will start soon, it is for
>>>> an open source project and will include prizes.
>>>>
>>>> I'm a part of the http://OpenPlaques.org/ project, the site collects
>>>> flickr images of commemorative plaques which are manually transcribed
>>>> and added to the site. Plaques generally have 20-100 words which
>>>> describe a historic situation - the words are clear, often in white on
>>>> a blue background. These plaques exist all over the UK (and in many
>>>> other countries). The goal of the project is to make these historic
>>>> locations easily searchable.
>>>>
>>>
>>> We have them in Ireland too. One aspect of them that you can use is
>>> that they tend to adhere to a set of 'template' phrases - Person was
>>> born here, site of the battle of X, etc.
>>>
>>>> Here's an example entry for Sir Whinston Churchill near me:
>>>> http://www.openplaques.org/plaques/990
>>>>
>>>> The project founders *manually* transcribe the plaque photos at
>>>> present - this is a crazy situation as they have several thousand
>>>> plaques outstanding and more are added every day. The project is now
>>>> international (it started in the UK less than a year ago) and an
>>>> automatic transcription system is sorely needed.
>>>>
>>>
>>> Not so crazy: if you already have a corpus of existing transcriptions,
>>> that puts you in a position to use statistical post-editing
>>> techniques.
>>>
>>> There is a tool for statistical post-editing here:
>>> http://www.cs.toronto.edu/~mreimer/tesseract.html
>>> If you have the option, I'd recommend changing it to output a word
>>> lattice, and feed that into an n-gram language model: IRSTLM is a good
>>> open source toolset for n-gram language modelling.
>>>
>>>> As a part of my play-time projects I've setup an Artificial
>>>> Intelligence Cookbook site where I'm building a community of
>>>> like-minded folk who like solving interesting challenges. I've already
>>>> documented a work-in-progress report on a manual solution to this
>>>> problem using tesseract 2:
>>>> http://blog.aicookbook.com/2010/06/optical-character-recognition-webservice-work-in-progress/
>>>> and I've just posted a software outline in Python for (bad!) automatic
>>>> recognition:
>>>> http://blog.aicookbook.com/2010/07/automatic-plaque-transcription-using-python-work-in-progress/
>>>>
>>>> The OpenPlaques project are building a corpus of images with
>>>> transcriptions for me, once we have a good set of images I'll begin
>>>> the challenge. This should be in the next two weeks.
>>>>
>>>> You can see my demo code and a suggested solution here:
>>>> http://aicookbook.com/wiki/Automatic_plaque_transcription
>>>> and I'm *very* open to feedback in our Google Group:
>>>> http://groups.google.com/group/aicookbook
>>>>
>>>
>>> It very much looks like you're still brainstorming: if absolute
>>> accuracy is your goal (and processing time, etc., are not so much of
>>> an issue) one thing that would work for plaques is this: many newer
>>> cameras add geotags to the EXIF tags. You can use the geotags to query
>>> DBpedia for a list of Wikipedia articles that pertain to a particular
>>> place, and extract a custom dictionary (and/or language model) for
>>> that place -- if there's a plaque commemorating something, then it's
>>> quite likely that Wikipedia mentions it. Names in particular tend to
>>> be quite problematic for OCR, this way you can generate custom
>>> wordlists that have a higher likelihood of containing those names. You
>>> should get good enough results from Tesseract (or, indeed, any OCR
>>> system) by passing such a list as the user dictionary, but if you
>>> decide to use statistical language models you would also need those
>>> words to be part of it, to avoid out-of-vocabulary errors. (IRST
>>> supports interpolating from numerous individual models, so that's not
>>> a problem).
>>>
>>>> I'll run the competition for several months with a prize for the best
>>>> solution each month. Solutions get open sourced and sooner or later a
>>>> good automatic solution will be created which can start automatically
>>>> transcribing the OpenPlaques corpus of images. Winners will also get
>>>> their name listed on the OpenPlaques site.
>>>>
>>>> If you'd like to test your skills with OCR then you'll find a good
>>>> range of images to work on - from simple clean shots to angled, dark,
>>>> smudged images of weather-beaten plaques taken at a distance.
>>>>
>>>> Cheers,
>>>> Ian.
>>>>
>>>> --
>>>> Ian Ozsvald (A.I. researcher, screencaster)
>>>> [email protected]
>>>>
>>>> http://IanOzsvald.com
>>>> http://MorConsulting.com/
>>>> http://blog.AICookbook.com/
>>>> http://TheScreencastingHandbook.com
>>>> http://FivePoundApp.com/
>>>> http://twitter.com/IanOzsvald
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to 
>>>> [email protected].
>>>> For more options, visit this group at 
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> <Leftmost> jimregan, that's because deep inside you, you are evil.
>>> <Leftmost> Also not-so-deep inside you.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups 
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to 
>>> [email protected].
>>> For more options, visit this group at 
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>>
>>
>>
>>
>> --
>> Ian Ozsvald (A.I. researcher, screencaster)
>> [email protected]
>>
>> http://IanOzsvald.com
>> http://MorConsulting.com/
>> http://blog.AICookbook.com/
>> http://TheScreencastingHandbook.com
>> http://FivePoundApp.com/
>> http://twitter.com/IanOzsvald
>>
>
>
>
> --
> Ian Ozsvald (A.I. researcher, screencaster)
> [email protected]
>
> http://IanOzsvald.com
> http://MorConsulting.com/
> http://blog.AICookbook.com/
> http://TheScreencastingHandbook.com
> http://FivePoundApp.com/
> http://twitter.com/IanOzsvald
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>



-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: OCR challenge with prizes for open source project - starting in two weeks

Reply via email to