Re: [tesseract-ocr] General strategies for dealing with problem images

Jonathan Muller Mon, 18 Mar 2019 22:03:26 -0700

I don't really agree with your statement. There is a lot of things we had
to consider with image processing before tesseract finally gave us accurate
results. But it all makes sense. Here is our actual pipeline:


 1 - Cleanup the image: remove any artifact of the camera or scan device,
cut the paper accurately, remove noise, binarize
 2 - Unskew the image: make text lines very horizontal
 3 - Cut the zone of interest: take text zone of interest in the document,
using DNN to recognize the zones
 4 - Clean the text zone: remove any unrelevant part in the image (like
lines, tables, stamps)
 5 - Create a whitelist based on the zone of probable characters (this one
improves accuracy a lot !)
 6 - Submit to tesseract with appropriate settings for the language

1: it is understandable how noise or image quality could affect recognition
2: tesseract expect lines of text to be straight
3: this reduces the processing speed and allow us to focus on the zone for
further cleaning (next steps) or custom parameters before submitting
4: lines, tables, and other things can alter recognition, because a piece
of line sometimes is recognised as |, -, _, l, `1`. it could also affect
nearby characters, especially when working with Chinese-based characters
5: whitelisting based on the content helps recognition a lot. simple
example is if you search for numbers, whitelist "1234567890" - 0 is close
to O. Even humans make the mistake, that's why we banned O from Wifi
passwords :laugh:
6: Settings of tesseract can improve a lot the recognition when working
with non-english scripts or when image is not perfect (tesseract works best
with dpi 300)

We gone from 10% accuracy to nearly 95% now. Each image is different and
each may require different processing or parameters. Making a solutions
that fits all is very complex, but I still think it is possible if the
application is specific enough. I guess that is why it is not included in
tesseract. Making it work very well for a specific use-case would break
others.

I guess you just have to find the right pre-processing for your kind of
image

Hope it thelps

On Mon, 18 Mar 2019 at 18:59, <[email protected]> wrote:

> I would like some advice concerning the general use of tesseract, because
> my experience with it tends to two extremes: either tesseract performs
> flawlessly, with no prior modification of the image necessary except
> cropping to the text and (most significant) enlarging the image by a factor
> of 2 or 4; or tesseract's output is riddled with errors.
>
> Following advice to improve the quality of the image (Fred's textcleaner
> script, or applying the Imagemagick functions it uses individually),
> usually produces significant improvement in *human readability* of the
> image, but as regards tesseract they usually produce no improvement, and
> most often actual deterioration in its performance.
>
> So I am looking for another reason to explain tesseract's difficulty with
> certain images. I thought perhaps its performance may be dependent on its
> trying to identify the particular font used, but
> https://github.com/tesseract-ocr/docs/blob/master/tesseracticdar2007.pdf
> seems to say not.
>
> The only other possibility I can think of is either the size or the aspect
> ratio of the text in the image has been subtly deformed. If so, it is not
> apparent to my eye, but certainly tesseract is very sensitive to size
> change, because, when it works, resizing the image makes such a dramatic
> improvement.
>
> Does anyone have other suggestions as to the nature of the problem? I'm
> not asking for detailed advice here, which is why I've given no image
> samples, but for general lines of attack, strategy rather than tactics.
> Thank you.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/15dcee7c-0815-47c3-9c74-29f8e90a7ca2%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/15dcee7c-0815-47c3-9c74-29f8e90a7ca2%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 
Jonathan
06.49.32.74.55

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CANEtbFRgob9wBs6UGzrbX_p2SdLd-M%3DSpSPgpmG_EV4LoFTHzw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] General strategies for dealing with problem images

Reply via email to