On 5 July 2010 17:06, Ian Ozsvald (A.I. Cookbook) <[email protected]> wrote: > I'd like to announce an OCR challenge that will start soon, it is for > an open source project and will include prizes. > > I'm a part of the http://OpenPlaques.org/ project, the site collects > flickr images of commemorative plaques which are manually transcribed > and added to the site. Plaques generally have 20-100 words which > describe a historic situation - the words are clear, often in white on > a blue background. These plaques exist all over the UK (and in many > other countries). The goal of the project is to make these historic > locations easily searchable. >
We have them in Ireland too. One aspect of them that you can use is that they tend to adhere to a set of 'template' phrases - Person was born here, site of the battle of X, etc. > Here's an example entry for Sir Whinston Churchill near me: > http://www.openplaques.org/plaques/990 > > The project founders *manually* transcribe the plaque photos at > present - this is a crazy situation as they have several thousand > plaques outstanding and more are added every day. The project is now > international (it started in the UK less than a year ago) and an > automatic transcription system is sorely needed. > Not so crazy: if you already have a corpus of existing transcriptions, that puts you in a position to use statistical post-editing techniques. There is a tool for statistical post-editing here: http://www.cs.toronto.edu/~mreimer/tesseract.html If you have the option, I'd recommend changing it to output a word lattice, and feed that into an n-gram language model: IRSTLM is a good open source toolset for n-gram language modelling. > As a part of my play-time projects I've setup an Artificial > Intelligence Cookbook site where I'm building a community of > like-minded folk who like solving interesting challenges. I've already > documented a work-in-progress report on a manual solution to this > problem using tesseract 2: > http://blog.aicookbook.com/2010/06/optical-character-recognition-webservice-work-in-progress/ > and I've just posted a software outline in Python for (bad!) automatic > recognition: > http://blog.aicookbook.com/2010/07/automatic-plaque-transcription-using-python-work-in-progress/ > > The OpenPlaques project are building a corpus of images with > transcriptions for me, once we have a good set of images I'll begin > the challenge. This should be in the next two weeks. > > You can see my demo code and a suggested solution here: > http://aicookbook.com/wiki/Automatic_plaque_transcription > and I'm *very* open to feedback in our Google Group: > http://groups.google.com/group/aicookbook > It very much looks like you're still brainstorming: if absolute accuracy is your goal (and processing time, etc., are not so much of an issue) one thing that would work for plaques is this: many newer cameras add geotags to the EXIF tags. You can use the geotags to query DBpedia for a list of Wikipedia articles that pertain to a particular place, and extract a custom dictionary (and/or language model) for that place -- if there's a plaque commemorating something, then it's quite likely that Wikipedia mentions it. Names in particular tend to be quite problematic for OCR, this way you can generate custom wordlists that have a higher likelihood of containing those names. You should get good enough results from Tesseract (or, indeed, any OCR system) by passing such a list as the user dictionary, but if you decide to use statistical language models you would also need those words to be part of it, to avoid out-of-vocabulary errors. (IRST supports interpolating from numerous individual models, so that's not a problem). > I'll run the competition for several months with a prize for the best > solution each month. Solutions get open sourced and sooner or later a > good automatic solution will be created which can start automatically > transcribing the OpenPlaques corpus of images. Winners will also get > their name listed on the OpenPlaques site. > > If you'd like to test your skills with OCR then you'll find a good > range of images to work on - from simple clean shots to angled, dark, > smudged images of weather-beaten plaques taken at a distance. > > Cheers, > Ian. > > -- > Ian Ozsvald (A.I. researcher, screencaster) > [email protected] > > http://IanOzsvald.com > http://MorConsulting.com/ > http://blog.AICookbook.com/ > http://TheScreencastingHandbook.com > http://FivePoundApp.com/ > http://twitter.com/IanOzsvald > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

