On 6 July 2010 10:09, Ian Ozsvald (A.I. Cookbook) <[email protected]> wrote: > Hi Jimmy. Thanks for the ideas. > > The plaques do crop up everywhere - I got some (square stone ones) in > Barcelona last year. I want to focus on the blue English Heritage ones > first as they represent a large part of the corpus and are most common > in the UK. > > Re. the template phrases - absolutely. A lot of them appear to have > sensible phrase types along with good named entities. I like the idea > of adding geotags into the mix, I had been mulling the idea of > searching for the named entites in Freebase and perhaps voting on the > results. > > The Freebase/WikiPedia links are useful too as they'd provide extra > annotations for OpenPlaques in the final result. > > Re. 'crazy' - only crazy as in "the sysadmins time is spent mostly > typing in plaques rather than working on ways of getting other > people/tools to enter the plaques in a more scaleable fashion" :-) > > Re. statistical techniques - good idea but that's out of my world of > experience. OCR is also recent for me, I only started playing with it > this year. A part of the reason for putting up prizes behind this > challenge is to see how people crack this particular nut. I'll see if > I can do some reading via the link, cheers. >
The statistical stuff is actually quite easy, in this case: very similar to the task of statistical spell checking. This article: http://norvig.com/spell-correct.html is a good introduction. Statistical post-editing for OCR usually only works with character n-grams: you can correct 'ehurch' to 'church' (as long as the corrector is set to look for 'e' where there should be 'c') simply because <START_OF_WORD> 'e' 'h' is an extremely unlikely combination, whereas <START_OF_WORD> 'c' 'h' will have a probability approaching 1.0; this only helps with ambiguous characters in unambiguous words, though; an n-gram language model does the same on a word level: 'are' and 'arc' are both valid words, so character statistics are no real help, but the statistics from the preceding words are: 'the are' is extremely unlikely, while 'the arc' is quite likely. > Re. dictionary - I'd overlooked that and was thinking of ways of using > the badly recognised text to vote on 'words that are more likely to be > good replacements'. I'll experiment with the dictionary using some of > the marked-up plaques. > > Much obliged, > Ian. > > On 5 July 2010 22:24, Jimmy O'Regan <[email protected]> wrote: >> On 5 July 2010 17:06, Ian Ozsvald (A.I. Cookbook) <[email protected]> >> wrote: >>> I'd like to announce an OCR challenge that will start soon, it is for >>> an open source project and will include prizes. >>> >>> I'm a part of the http://OpenPlaques.org/ project, the site collects >>> flickr images of commemorative plaques which are manually transcribed >>> and added to the site. Plaques generally have 20-100 words which >>> describe a historic situation - the words are clear, often in white on >>> a blue background. These plaques exist all over the UK (and in many >>> other countries). The goal of the project is to make these historic >>> locations easily searchable. >>> >> >> We have them in Ireland too. One aspect of them that you can use is >> that they tend to adhere to a set of 'template' phrases - Person was >> born here, site of the battle of X, etc. >> >>> Here's an example entry for Sir Whinston Churchill near me: >>> http://www.openplaques.org/plaques/990 >>> >>> The project founders *manually* transcribe the plaque photos at >>> present - this is a crazy situation as they have several thousand >>> plaques outstanding and more are added every day. The project is now >>> international (it started in the UK less than a year ago) and an >>> automatic transcription system is sorely needed. >>> >> >> Not so crazy: if you already have a corpus of existing transcriptions, >> that puts you in a position to use statistical post-editing >> techniques. >> >> There is a tool for statistical post-editing here: >> http://www.cs.toronto.edu/~mreimer/tesseract.html >> If you have the option, I'd recommend changing it to output a word >> lattice, and feed that into an n-gram language model: IRSTLM is a good >> open source toolset for n-gram language modelling. >> >>> As a part of my play-time projects I've setup an Artificial >>> Intelligence Cookbook site where I'm building a community of >>> like-minded folk who like solving interesting challenges. I've already >>> documented a work-in-progress report on a manual solution to this >>> problem using tesseract 2: >>> http://blog.aicookbook.com/2010/06/optical-character-recognition-webservice-work-in-progress/ >>> and I've just posted a software outline in Python for (bad!) automatic >>> recognition: >>> http://blog.aicookbook.com/2010/07/automatic-plaque-transcription-using-python-work-in-progress/ >>> >>> The OpenPlaques project are building a corpus of images with >>> transcriptions for me, once we have a good set of images I'll begin >>> the challenge. This should be in the next two weeks. >>> >>> You can see my demo code and a suggested solution here: >>> http://aicookbook.com/wiki/Automatic_plaque_transcription >>> and I'm *very* open to feedback in our Google Group: >>> http://groups.google.com/group/aicookbook >>> >> >> It very much looks like you're still brainstorming: if absolute >> accuracy is your goal (and processing time, etc., are not so much of >> an issue) one thing that would work for plaques is this: many newer >> cameras add geotags to the EXIF tags. You can use the geotags to query >> DBpedia for a list of Wikipedia articles that pertain to a particular >> place, and extract a custom dictionary (and/or language model) for >> that place -- if there's a plaque commemorating something, then it's >> quite likely that Wikipedia mentions it. Names in particular tend to >> be quite problematic for OCR, this way you can generate custom >> wordlists that have a higher likelihood of containing those names. You >> should get good enough results from Tesseract (or, indeed, any OCR >> system) by passing such a list as the user dictionary, but if you >> decide to use statistical language models you would also need those >> words to be part of it, to avoid out-of-vocabulary errors. (IRST >> supports interpolating from numerous individual models, so that's not >> a problem). >> >>> I'll run the competition for several months with a prize for the best >>> solution each month. Solutions get open sourced and sooner or later a >>> good automatic solution will be created which can start automatically >>> transcribing the OpenPlaques corpus of images. Winners will also get >>> their name listed on the OpenPlaques site. >>> >>> If you'd like to test your skills with OCR then you'll find a good >>> range of images to work on - from simple clean shots to angled, dark, >>> smudged images of weather-beaten plaques taken at a distance. >>> >>> Cheers, >>> Ian. >>> >>> -- >>> Ian Ozsvald (A.I. researcher, screencaster) >>> [email protected] >>> >>> http://IanOzsvald.com >>> http://MorConsulting.com/ >>> http://blog.AICookbook.com/ >>> http://TheScreencastingHandbook.com >>> http://FivePoundApp.com/ >>> http://twitter.com/IanOzsvald >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "tesseract-ocr" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]. >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en. >>> >>> >> >> >> >> -- >> <Leftmost> jimregan, that's because deep inside you, you are evil. >> <Leftmost> Also not-so-deep inside you. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en. >> >> > > > > -- > Ian Ozsvald (A.I. researcher, screencaster) > [email protected] > > http://IanOzsvald.com > http://MorConsulting.com/ > http://blog.AICookbook.com/ > http://TheScreencastingHandbook.com > http://FivePoundApp.com/ > http://twitter.com/IanOzsvald > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

