[tesseract-ocr] General strategies for dealing with problem images

gl00637 Mon, 18 Mar 2019 10:59:22 -0700

I would like some advice concerning the general use of tesseract, because 
my experience with it tends to two extremes: either tesseract performs 
flawlessly, with no prior modification of the image necessary except 
cropping to the text and (most significant) enlarging the image by a factor 
of 2 or 4; or tesseract's output is riddled with errors.

Following advice to improve the quality of the image (Fred's textcleaner
script, or applying the Imagemagick functions it uses individually),
usually produces significant improvement in *human readability* of the
image, but as regards tesseract they usually produce no improvement, and
most often actual deterioration in its performance.

So I am looking for another reason to explain tesseract's difficulty with
certain images. I thought perhaps its performance may be dependent on its
trying to identify the particular font used, but
https://github.com/tesseract-ocr/docs/blob/master/tesseracticdar2007.pdf
seems to say not.

The only other possibility I can think of is either the size or the aspect
ratio of the text in the image has been subtly deformed. If so, it is not
apparent to my eye, but certainly tesseract is very sensitive to size
change, because, when it works, resizing the image makes such a dramatic
improvement.

Does anyone have other suggestions as to the nature of the problem? I'm not
asking for detailed advice here, which is why I've given no image samples,
but for general lines of attack, strategy rather than tactics. Thank you.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/15dcee7c-0815-47c3-9c74-29f8e90a7ca2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] General strategies for dealing with problem images

Reply via email to