Re: Image pre-processing for good OCR results

Dmitry Silaev Mon, 21 Feb 2011 23:12:31 -0800

Jon,

You will certainly need to implement mostly the steps that Cong Nguyen
suggests. However complications arise if you wish to do pre-processing
in a pure automatic way. You are going to precess real photographic
images, and therefore fonts, backgrounds, lighting conditions, etc.
differ much. And that's why a "one fits all" method (particularly for
ROI detection and background removal) won't work. You will encounter
that your fixed pipeline works fine with the first and second images
but fails with the third one.

There are two possible ways to solve this. If you still want to do it
automatically you'll need to choose several algorithms for every
pipeline stage and implement a logic that would automatically, based
on some metric, decide for each image which algorithm would work (or
have worked) best. Or you can give up automatic approach and switch to
manual selection of pre-processing scenarios for each image according
to your experience.

The next complication is getting results from Tesseract. Since the
quality of text in photographic images is really low, usually you
can't rely on that Tesseract's top-choice recognition results
represent actual text. Imho the best approach here is to get all
Tesseract's choices for every character and then remove uncertainty
using language model (bigram and trigram statistics). This is the best
you can do because dictionary won't help you much, at least for last
names.

And then you'll have to locate names within the recognition results.
The first problem here is in that they can be few per headstone. The
second one is in that Tesseract will try to recognize as text
everything it sees in the image, including noise left from
pre-processing. So this task can also pose some difficulties. But this
seems to be mainly a question of engineering, not of research...

To conclude, it all depends on how serious you are about investing
your time and efforts into your project ))

HTH

Warm regards,
Dmitry Silaev

On Mon, Feb 21, 2011 at 6:45 PM, Jon Andersen <[email protected]> wrote:
> Whoops, sorry - links were broken for a bit.  I just fixed the image links,
> they should work now.
> Thanks!!
> -Jon
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Image pre-processing for good OCR results

Reply via email to