On 6 July 2010 22:41, Ian Ozsvald (A.I. Cookbook) <[email protected]> wrote: > Thanks for the geo-location idea, I've updated the wiki having tried > wikilocation.org > http://aicookbook.com/wiki/Automatic_plaque_transcription > > I've noticed that some of the relevant pages in WikiPedia aren't > geo-tagged but they're linked from geo-tagged pages, or direct > searches (perhaps with two passes on tesseract) sometimes reveal > useful pages. There's certainly good source material to use here. >
Yes; that's why I mentioned DBpedia. DBpedia uses Wikipedia's structure to extract semantic data (RDF) and provides a public Sparql endpoint to query - I'll assume that, as you've mentioned AI, you either know what I'm talking about, or these will be easy concepts for you to pick up :) There's also the geonames dataset, to allow reverse geolookup. I think Freebase contains both geonames and dbpedia, so you're covered. > I've tried your suggestion of adding some dictionary words but that > didn't change the quality of recognition. I edited: > /usr/local/share/tessdata/eng.user-words > which had 925 lines of data already (I'm using tesseract 3 via svn > built last night with 'sudo make install' on my MacBook). > > I confirmed that TESSDATA_PREFIX points at this location (and made it > go elsewhere just to check that tesseract reported an error). > > Having added: > 1866 > Gold > Albert > Medal > posthumously > 1881 > when recognising a black and white, thresholded version of: > http://www.flickr.com/photos/54145...@n00/4701399020/ > it still fails to recognise the above words (I added these after the > first run of tesseract, they were the poorest recognised words). There > is no difference in the output file before/after adding these lines. > > Am I doing something silly? > Well, adding /those/ words shouldn't make much of a difference - they should be in the normal dictionaries - the suggestion was intended more for less common proper names. Also, Tesseract uses an adaptive classifier, so concatenating multiple images together (say, a multipage tiff) should gave much better results than running on each page individually, but it occurs to me now that persisting the classifier's state would be useful in a variety of other areas, such as business card scanning. > Cheers, > Ian. > > On 6 July 2010 10:09, Ian Ozsvald (A.I. Cookbook) <[email protected]> wrote: >> Hi Jimmy. Thanks for the ideas. >> >> The plaques do crop up everywhere - I got some (square stone ones) in >> Barcelona last year. I want to focus on the blue English Heritage ones >> first as they represent a large part of the corpus and are most common >> in the UK. >> >> Re. the template phrases - absolutely. A lot of them appear to have >> sensible phrase types along with good named entities. I like the idea >> of adding geotags into the mix, I had been mulling the idea of >> searching for the named entites in Freebase and perhaps voting on the >> results. >> >> The Freebase/WikiPedia links are useful too as they'd provide extra >> annotations for OpenPlaques in the final result. >> >> Re. 'crazy' - only crazy as in "the sysadmins time is spent mostly >> typing in plaques rather than working on ways of getting other >> people/tools to enter the plaques in a more scaleable fashion" :-) >> >> Re. statistical techniques - good idea but that's out of my world of >> experience. OCR is also recent for me, I only started playing with it >> this year. A part of the reason for putting up prizes behind this >> challenge is to see how people crack this particular nut. I'll see if >> I can do some reading via the link, cheers. >> >> Re. dictionary - I'd overlooked that and was thinking of ways of using >> the badly recognised text to vote on 'words that are more likely to be >> good replacements'. I'll experiment with the dictionary using some of >> the marked-up plaques. >> >> Much obliged, >> Ian. >> >> On 5 July 2010 22:24, Jimmy O'Regan <[email protected]> wrote: >>> On 5 July 2010 17:06, Ian Ozsvald (A.I. Cookbook) <[email protected]> >>> wrote: >>>> I'd like to announce an OCR challenge that will start soon, it is for >>>> an open source project and will include prizes. >>>> >>>> I'm a part of the http://OpenPlaques.org/ project, the site collects >>>> flickr images of commemorative plaques which are manually transcribed >>>> and added to the site. Plaques generally have 20-100 words which >>>> describe a historic situation - the words are clear, often in white on >>>> a blue background. These plaques exist all over the UK (and in many >>>> other countries). The goal of the project is to make these historic >>>> locations easily searchable. >>>> >>> >>> We have them in Ireland too. One aspect of them that you can use is >>> that they tend to adhere to a set of 'template' phrases - Person was >>> born here, site of the battle of X, etc. >>> >>>> Here's an example entry for Sir Whinston Churchill near me: >>>> http://www.openplaques.org/plaques/990 >>>> >>>> The project founders *manually* transcribe the plaque photos at >>>> present - this is a crazy situation as they have several thousand >>>> plaques outstanding and more are added every day. The project is now >>>> international (it started in the UK less than a year ago) and an >>>> automatic transcription system is sorely needed. >>>> >>> >>> Not so crazy: if you already have a corpus of existing transcriptions, >>> that puts you in a position to use statistical post-editing >>> techniques. >>> >>> There is a tool for statistical post-editing here: >>> http://www.cs.toronto.edu/~mreimer/tesseract.html >>> If you have the option, I'd recommend changing it to output a word >>> lattice, and feed that into an n-gram language model: IRSTLM is a good >>> open source toolset for n-gram language modelling. >>> >>>> As a part of my play-time projects I've setup an Artificial >>>> Intelligence Cookbook site where I'm building a community of >>>> like-minded folk who like solving interesting challenges. I've already >>>> documented a work-in-progress report on a manual solution to this >>>> problem using tesseract 2: >>>> http://blog.aicookbook.com/2010/06/optical-character-recognition-webservice-work-in-progress/ >>>> and I've just posted a software outline in Python for (bad!) automatic >>>> recognition: >>>> http://blog.aicookbook.com/2010/07/automatic-plaque-transcription-using-python-work-in-progress/ >>>> >>>> The OpenPlaques project are building a corpus of images with >>>> transcriptions for me, once we have a good set of images I'll begin >>>> the challenge. This should be in the next two weeks. >>>> >>>> You can see my demo code and a suggested solution here: >>>> http://aicookbook.com/wiki/Automatic_plaque_transcription >>>> and I'm *very* open to feedback in our Google Group: >>>> http://groups.google.com/group/aicookbook >>>> >>> >>> It very much looks like you're still brainstorming: if absolute >>> accuracy is your goal (and processing time, etc., are not so much of >>> an issue) one thing that would work for plaques is this: many newer >>> cameras add geotags to the EXIF tags. You can use the geotags to query >>> DBpedia for a list of Wikipedia articles that pertain to a particular >>> place, and extract a custom dictionary (and/or language model) for >>> that place -- if there's a plaque commemorating something, then it's >>> quite likely that Wikipedia mentions it. Names in particular tend to >>> be quite problematic for OCR, this way you can generate custom >>> wordlists that have a higher likelihood of containing those names. You >>> should get good enough results from Tesseract (or, indeed, any OCR >>> system) by passing such a list as the user dictionary, but if you >>> decide to use statistical language models you would also need those >>> words to be part of it, to avoid out-of-vocabulary errors. (IRST >>> supports interpolating from numerous individual models, so that's not >>> a problem). >>> >>>> I'll run the competition for several months with a prize for the best >>>> solution each month. Solutions get open sourced and sooner or later a >>>> good automatic solution will be created which can start automatically >>>> transcribing the OpenPlaques corpus of images. Winners will also get >>>> their name listed on the OpenPlaques site. >>>> >>>> If you'd like to test your skills with OCR then you'll find a good >>>> range of images to work on - from simple clean shots to angled, dark, >>>> smudged images of weather-beaten plaques taken at a distance. >>>> >>>> Cheers, >>>> Ian. >>>> >>>> -- >>>> Ian Ozsvald (A.I. researcher, screencaster) >>>> [email protected] >>>> >>>> http://IanOzsvald.com >>>> http://MorConsulting.com/ >>>> http://blog.AICookbook.com/ >>>> http://TheScreencastingHandbook.com >>>> http://FivePoundApp.com/ >>>> http://twitter.com/IanOzsvald >>>> >>>> -- >>>> You received this message because you are subscribed to the Google Groups >>>> "tesseract-ocr" group. >>>> To post to this group, send email to [email protected]. >>>> To unsubscribe from this group, send email to >>>> [email protected]. >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>> >>>> >>> >>> >>> >>> -- >>> <Leftmost> jimregan, that's because deep inside you, you are evil. >>> <Leftmost> Also not-so-deep inside you. >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "tesseract-ocr" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]. >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en. >>> >>> >> >> >> >> -- >> Ian Ozsvald (A.I. researcher, screencaster) >> [email protected] >> >> http://IanOzsvald.com >> http://MorConsulting.com/ >> http://blog.AICookbook.com/ >> http://TheScreencastingHandbook.com >> http://FivePoundApp.com/ >> http://twitter.com/IanOzsvald >> > > > > -- > Ian Ozsvald (A.I. researcher, screencaster) > [email protected] > > http://IanOzsvald.com > http://MorConsulting.com/ > http://blog.AICookbook.com/ > http://TheScreencastingHandbook.com > http://FivePoundApp.com/ > http://twitter.com/IanOzsvald > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

