Note: The links on the external add-ons page are broken due to
spurious asterisks.

This is a follow-up on my effort to improve Tesseract's accuracy using
language models:

After working with Tesseract a little, I noticed that many of the
errors look like they should be easy to correct with a dictionary.  I
played with larger dictionaries and different NON_WERD and
GARBAGE_STRING values, but that tends to work only at the expense
overall accuracy.  So I set out to find a better way to use the
dictionary, and implemented it as a post-processor written in python.
In short it uses word frequency estimates and a Simple Good-Turing
Estimator (see Good-Turing frequency estimation without tears, WA Gale
& G Sampson, Journal of Quantitative Linguistics, 1995) to assign
reasonable probabilities to all words whether or not they're in the
dictionary.

The final result is a small but I think significant accuracy gain on
the UNLV test set.  Unmodified tess2.0 gives me 15733 word errors, and
my post-processor gets that down to 15078 (which amounts to a
substantial improvement in the relative error reduction).  The cost is
a lot of training and a very small increase in character errors. Two
mitigating factors are that running separately from Tesseract meant
not having access to the original character probabilities, and that
the post-processor was trained on British English as a matter of
convenience. So it ought to be possible to do better than this,
perhaps quite a bit better.

More details, code, and instructions for use are on my web site:
http://www.cs.toronto.edu/~mreimer/tesseract.html#postprocessor

I'm writing this mainly to bring my result to the attention of the
developers and get some feedback on if/how we can build something like
this into the Tesseract runtime.  You'll probably need more
information in order to comment on that, but I don't want to spam this
list with a lot of technical info that won't be of interest to most
readers, so if you're interested then perhaps you could send me a
personal e-mail.


> > > On Mon, Feb 2, 2009 at 3:04 PM, Michael Reimer <[email protected]
> > >wrote:
>
> > > > Hello all.  I'm a computational linguistics graduate student, and I'd
> > > > like to do some work on Tesseract for credit in a software engineering
> > > > course.  My area of interest is word order modelling and I believe
> > > > this can and has been used to improve the accuracy of other OCR
> > > > systems.  As far as I can tell, Tesseract has nothing similar
> > > > currently, so I'm interested in adding it.  Any feedback on that idea
> > > > would be appreciated.
>
> > > > I'm also completely new to open source, so provided that my general
> > > > goal appeals, I would appreciate any and all advice on how best to get
> > > > involved with your community, get myself up to speed technically (I am
> > > > reading the "Hacking Tesseract" manual currently), avoid stepping on
> > > > toes, and so on.  Thanks.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to