Note: The links on the external add-ons page are broken due to spurious asterisks.
This is a follow-up on my effort to improve Tesseract's accuracy using language models: After working with Tesseract a little, I noticed that many of the errors look like they should be easy to correct with a dictionary. I played with larger dictionaries and different NON_WERD and GARBAGE_STRING values, but that tends to work only at the expense overall accuracy. So I set out to find a better way to use the dictionary, and implemented it as a post-processor written in python. In short it uses word frequency estimates and a Simple Good-Turing Estimator (see Good-Turing frequency estimation without tears, WA Gale & G Sampson, Journal of Quantitative Linguistics, 1995) to assign reasonable probabilities to all words whether or not they're in the dictionary. The final result is a small but I think significant accuracy gain on the UNLV test set. Unmodified tess2.0 gives me 15733 word errors, and my post-processor gets that down to 15078 (which amounts to a substantial improvement in the relative error reduction). The cost is a lot of training and a very small increase in character errors. Two mitigating factors are that running separately from Tesseract meant not having access to the original character probabilities, and that the post-processor was trained on British English as a matter of convenience. So it ought to be possible to do better than this, perhaps quite a bit better. More details, code, and instructions for use are on my web site: http://www.cs.toronto.edu/~mreimer/tesseract.html#postprocessor I'm writing this mainly to bring my result to the attention of the developers and get some feedback on if/how we can build something like this into the Tesseract runtime. You'll probably need more information in order to comment on that, but I don't want to spam this list with a lot of technical info that won't be of interest to most readers, so if you're interested then perhaps you could send me a personal e-mail. > > > On Mon, Feb 2, 2009 at 3:04 PM, Michael Reimer <[email protected] > > >wrote: > > > > > Hello all. I'm a computational linguistics graduate student, and I'd > > > > like to do some work on Tesseract for credit in a software engineering > > > > course. My area of interest is word order modelling and I believe > > > > this can and has been used to improve the accuracy of other OCR > > > > systems. As far as I can tell, Tesseract has nothing similar > > > > currently, so I'm interested in adding it. Any feedback on that idea > > > > would be appreciated. > > > > > I'm also completely new to open source, so provided that my general > > > > goal appeals, I would appreciate any and all advice on how best to get > > > > involved with your community, get myself up to speed technically (I am > > > > reading the "Hacking Tesseract" manual currently), avoid stepping on > > > > toes, and so on. Thanks. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

