I don't really agree with your statement. There is a lot of things we had to consider with image processing before tesseract finally gave us accurate results. But it all makes sense. Here is our actual pipeline:
1 - Cleanup the image: remove any artifact of the camera or scan device, cut the paper accurately, remove noise, binarize 2 - Unskew the image: make text lines very horizontal 3 - Cut the zone of interest: take text zone of interest in the document, using DNN to recognize the zones 4 - Clean the text zone: remove any unrelevant part in the image (like lines, tables, stamps) 5 - Create a whitelist based on the zone of probable characters (this one improves accuracy a lot !) 6 - Submit to tesseract with appropriate settings for the language 1: it is understandable how noise or image quality could affect recognition 2: tesseract expect lines of text to be straight 3: this reduces the processing speed and allow us to focus on the zone for further cleaning (next steps) or custom parameters before submitting 4: lines, tables, and other things can alter recognition, because a piece of line sometimes is recognised as |, -, _, l, `1`. it could also affect nearby characters, especially when working with Chinese-based characters 5: whitelisting based on the content helps recognition a lot. simple example is if you search for numbers, whitelist "1234567890" - 0 is close to O. Even humans make the mistake, that's why we banned O from Wifi passwords :laugh: 6: Settings of tesseract can improve a lot the recognition when working with non-english scripts or when image is not perfect (tesseract works best with dpi 300) We gone from 10% accuracy to nearly 95% now. Each image is different and each may require different processing or parameters. Making a solutions that fits all is very complex, but I still think it is possible if the application is specific enough. I guess that is why it is not included in tesseract. Making it work very well for a specific use-case would break others. I guess you just have to find the right pre-processing for your kind of image Hope it thelps On Mon, 18 Mar 2019 at 18:59, <[email protected]> wrote: > I would like some advice concerning the general use of tesseract, because > my experience with it tends to two extremes: either tesseract performs > flawlessly, with no prior modification of the image necessary except > cropping to the text and (most significant) enlarging the image by a factor > of 2 or 4; or tesseract's output is riddled with errors. > > Following advice to improve the quality of the image (Fred's textcleaner > script, or applying the Imagemagick functions it uses individually), > usually produces significant improvement in *human readability* of the > image, but as regards tesseract they usually produce no improvement, and > most often actual deterioration in its performance. > > So I am looking for another reason to explain tesseract's difficulty with > certain images. I thought perhaps its performance may be dependent on its > trying to identify the particular font used, but > https://github.com/tesseract-ocr/docs/blob/master/tesseracticdar2007.pdf > seems to say not. > > The only other possibility I can think of is either the size or the aspect > ratio of the text in the image has been subtly deformed. If so, it is not > apparent to my eye, but certainly tesseract is very sensitive to size > change, because, when it works, resizing the image makes such a dramatic > improvement. > > Does anyone have other suggestions as to the nature of the problem? I'm > not asking for detailed advice here, which is why I've given no image > samples, but for general lines of attack, strategy rather than tactics. > Thank you. > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/15dcee7c-0815-47c3-9c74-29f8e90a7ca2%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/15dcee7c-0815-47c3-9c74-29f8e90a7ca2%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- Jonathan 06.49.32.74.55 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CANEtbFRgob9wBs6UGzrbX_p2SdLd-M%3DSpSPgpmG_EV4LoFTHzw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

