Hi, last week i did a lot of experimentation with binarization in my preprocessing. If you feed Tesseract with unbinarized data, it uses Otsu Thresholding. But even though documentation states that it uses adaptive (local) Otsu, i cannot find that in the code - but i am not an expert with c. And if you do _global_Otsu with "difficult because of low, varying contrast" input outside of Tesseract and feed the result to Tesseract, the output matches expectation. So i guess, tess4 defaults to global Otsu what really is a bad idea given the fact, that local binarizations should give better results in almost every case.
So i used Fiji/ImageJ to check other binarization Methods. The one with best results for OCR with my data was the Phansalkar-Method. I advise you to test different binarizations with your data, then feed it into Tess4 and then compare results. Fiji-plugins are quite simple to modify, you can even test your own binarization method - i tried some variation of Phansalkar modifying the k variable depending on the local-mean to check if i was "on a line of text" - so far with pretty good results in very problematic regions with low contrast between text and background. P.S: Despecle ans Skew-Correction filters might also help a lot, just play with them on a "typical" set of your data. Am Fr., 1. Feb. 2019 um 14:06 Uhr schrieb Lorenzo Bolzani < [email protected]>: > > Yes, old OCR solutions use binarized content but I see this as a legacy > limitation. It was probably done to speed up the processing and also, I > suppose, because the algorithms used would not benefit from the extra gray > details anyway. Old ocr tech was also print oriented so the text was > already near binary. > > With a neural network there is no extra time cost in processing grayscale > or binary text, they are just float values in both cases. Binarization > throws away a lot of data, especially with noisy images, complex > backgrounds, etc. (ID documents, smartphone pictures, etc.). > > Binarization may improve OCR performance but I doubt: a CNN should be > easily able to learn to binarize the image itself if this can improve the > results. > > The only advantage I see for binarization is that syntetic training data > is binary, so you try to match the real input data to the one used for > training. Of course you could corrupt this data to make it grayscale. > > But I would expect a full grayscale training and prediction to give > slightly better results, especially for complex cases. > > I fine tuned my models using grayscale data (real world crops, not > synthetic) and, if possible, I'd like to try to disable the binarization > step to see if I get an improvement. Maybe there are some parameters > controlling this step. > > > Thanks > > Lorenzo > > Il giorno gio 31 gen 2019 alle ore 20:42 Zdenko Podobny <[email protected]> > ha scritto: > >> see inline comments. >> >> st 30. 1. 2019 o 15:17 Lorenzo Bolzani <[email protected]> napísal(a): >> >>> >>> I suppose this means that the image is always binarized, is this correct? >>> >> Yes >> >>> >>> Is there any way to avoid it? >>> >> >> Why? IMO OCR engines are running on binarized images see e.g. >> https://www.abbyy.com/en-eu/ocr-sdk/key-features/image-processing/ >> >> >>> Does this binarization happens by default during training too? >>> >> >> I do not know. I did not have time to play with training in v 4.0 >> >>> >>> I fine tuned a few models using grayscale images. Do you thing the >>> neural network received binary black/white pixels or the gray ones? >>> >>> I do not know. >> >>> >>> Thanks, bye >>> >>> Lorenzo >>> >>> Il giorno mer 30 gen 2019 alle ore 13:28 Zdenko Podobny < >>> [email protected]> ha scritto: >>> >>>> try: >>>> tesseract image - get.image >>>> which calls GetThresholdedImage() >>>> <https://github.com/tesseract-ocr/tesseract/blob/12c1abcb6b4ef90cfafe316a3b40753ee5e9b9ef/src/api/baseapi.cpp#L638> >>>> >>>> >>>> Zdenko >>>> >>>> >>>> st 30. 1. 2019 o 11:17 Lorenzo Bolzani <[email protected]> >>>> napísal(a): >>>> >>>>> >>>>> Zdenko, are you 100% sure that the image is binarized before being fed >>>>> to the neural network? It looks like a big waste of information to me. >>>>> >>>>> >>>>> Il giorno mer 30 gen 2019 alle ore 07:56 Zdenko Podobny < >>>>> [email protected]> ha scritto: >>>>> >>>>>> That is not true: you do not need to transform image to grayscale. >>>>>> Any image is at the end binarized (if input image is not binarized) by >>>>>> tesseract (Otsu). >>>>>> >>>>>> BUT: preprocessing image (e.g. custom binarization) will help. See >>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality >>>>>> >>>>>> Zdenko >>>>>> >>>>>> >>>>>> st 30. 1. 2019 o 7:36 <[email protected]> napísal(a): >>>>>> >>>>>>> Not a solution, but the image needs to be transformed into >>>>>>> grayscaleetc, (using Open CV) since OCR works best with grayed images >>>>>>> and >>>>>>> images which have size of 300 dpi >>>>>>> >>>>>>> On Tuesday, June 12, 2018 at 12:44:21 PM UTC+5:30, Vidur Malhotra >>>>>>> wrote: >>>>>>>> >>>>>>>> I tried running the tesseract on the attached image. But not >>>>>>>> getting the desired output. My sample code: >>>>>>>> >>>>>>>> >>>>>>>> import PIL >>>>>>>> from PIL import Image >>>>>>>> import pytesseract >>>>>>>> >>>>>>>> text = pytesseract.image_to_string(Image.open('test3.jpg'), >>>>>>>> lang='eng') >>>>>>>> print(text) >>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/43a5a754-4227-43b6-aec1-0261403b2029%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/43a5a754-4227-43b6-aec1-0261403b2029%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zsShhi_LsUXkCWoj8uxWxROkTL7G4RpGwzBEVm1EweTA%40mail.gmail.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zsShhi_LsUXkCWoj8uxWxROkTL7G4RpGwzBEVm1EweTA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxVGXt527-wkSfGegMtOjMU2LT0rz_H%3Dp8kQZ13CCE1ag%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxVGXt527-wkSfGegMtOjMU2LT0rz_H%3Dp8kQZ13CCE1ag%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zse4y474Uz7w0pJEuyDesRgD6fuQu_Y0cMDzGH4Ux7JA%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zse4y474Uz7w0pJEuyDesRgD6fuQu_Y0cMDzGH4Ux7JA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwkdPEWjteO1%2BJ0M0TVvfcfB9hvEX-3uQUvo_8dnr%2B2kw%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwkdPEWjteO1%2BJ0M0TVvfcfB9hvEX-3uQUvo_8dnr%2B2kw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zBcMoYnt1WPjJffLVoRXoFZcBDu5QgnQ3gOyJTobi0aw%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zBcMoYnt1WPjJffLVoRXoFZcBDu5QgnQ3gOyJTobi0aw%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwj%2B-xKQttOog-DCfii6TJp%2BqdWij8FpoeUpBgtJZH6Ww%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwj%2B-xKQttOog-DCfii6TJp%2BqdWij8FpoeUpBgtJZH6Ww%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CADXFd_0kO%2BUfgwDrj2otXvF98ZvwwckgEeRrPpN78VOxM7-hXQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

