Hi,

last week i did a lot of experimentation with binarization in my
preprocessing. If you feed Tesseract with unbinarized data, it uses Otsu
Thresholding. But even though documentation states that it uses adaptive
(local) Otsu, i cannot find that in the code - but i am not an expert with
c. And if you do _global_Otsu with "difficult because of low, varying
contrast" input outside of Tesseract and feed the result to Tesseract, the
output matches expectation. So i guess, tess4 defaults to global Otsu what
really is a bad idea given the fact, that local binarizations should give
better results in almost every case.

So i used Fiji/ImageJ to check other binarization Methods. The one with
best results for OCR with my data was the Phansalkar-Method. I advise you
to test different binarizations with your data, then feed it into Tess4 and
then compare results. Fiji-plugins are quite simple to modify, you can even
test your own binarization method - i tried some variation of Phansalkar
modifying the k variable depending on the local-mean to check if i was "on
a line of text" - so far with pretty good results in very problematic
regions with low contrast between text and background.

P.S: Despecle ans Skew-Correction filters might also help a lot, just play
with them on a "typical" set of your data.



Am Fr., 1. Feb. 2019 um 14:06 Uhr schrieb Lorenzo Bolzani <
[email protected]>:

>
> Yes, old OCR solutions use binarized content but I see this as a legacy
> limitation. It was probably done to speed up the processing and also, I
> suppose, because the algorithms used would not benefit from the extra gray
> details anyway. Old ocr tech was also print oriented so the text was
> already near binary.
>
> With a neural network there is no extra time cost in processing grayscale
> or binary text, they are just float values in both cases. Binarization
> throws away a lot of data, especially with noisy images, complex
> backgrounds, etc. (ID documents, smartphone pictures, etc.).
>
> Binarization may improve OCR performance but I doubt: a CNN should be
> easily able to learn to binarize the image itself if this can improve the
> results.
>
> The only advantage I see for binarization is that syntetic training data
> is binary, so you try to match the real input data to the one used for
> training. Of course you could corrupt this data to make it grayscale.
>
> But I would expect a full grayscale training and prediction to give
> slightly better results, especially for complex cases.
>
> I fine tuned my models using grayscale data (real world crops, not
> synthetic) and, if possible, I'd like to try to disable the binarization
> step to see if I get an improvement. Maybe there are some parameters
> controlling this step.
>
>
> Thanks
>
> Lorenzo
>
> Il giorno gio 31 gen 2019 alle ore 20:42 Zdenko Podobny <[email protected]>
> ha scritto:
>
>> see inline comments.
>>
>> st 30. 1. 2019 o 15:17 Lorenzo Bolzani <[email protected]> napísal(a):
>>
>>>
>>> I suppose this means that the image is always binarized, is this correct?
>>>
>> Yes
>>
>>>
>>> Is there any way to avoid it?
>>>
>>
>> Why? IMO OCR engines are running on binarized images see e.g.
>> https://www.abbyy.com/en-eu/ocr-sdk/key-features/image-processing/
>>
>>
>>> Does this binarization happens by default during training too?
>>>
>>
>> I do not know. I did not have time to play with training in v 4.0
>>
>>>
>>> I fine tuned a few models using grayscale images. Do you thing the
>>> neural network received binary black/white pixels or the gray ones?
>>>
>>> I do not know.
>>
>>>
>>> Thanks, bye
>>>
>>> Lorenzo
>>>
>>> Il giorno mer 30 gen 2019 alle ore 13:28 Zdenko Podobny <
>>> [email protected]> ha scritto:
>>>
>>>> try:
>>>>  tesseract image - get.image
>>>> which calls GetThresholdedImage()
>>>> <https://github.com/tesseract-ocr/tesseract/blob/12c1abcb6b4ef90cfafe316a3b40753ee5e9b9ef/src/api/baseapi.cpp#L638>
>>>>
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> st 30. 1. 2019 o 11:17 Lorenzo Bolzani <[email protected]>
>>>> napísal(a):
>>>>
>>>>>
>>>>> Zdenko, are you 100% sure that the image is binarized before being fed
>>>>> to the neural network? It looks like a big waste of information to me.
>>>>>
>>>>>
>>>>> Il giorno mer 30 gen 2019 alle ore 07:56 Zdenko Podobny <
>>>>> [email protected]> ha scritto:
>>>>>
>>>>>> That is not true: you do not need to transform image to grayscale.
>>>>>> Any image is at the end binarized (if input image is not binarized) by
>>>>>> tesseract (Otsu).
>>>>>>
>>>>>> BUT: preprocessing image (e.g. custom binarization) will help. See
>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
>>>>>>
>>>>>> Zdenko
>>>>>>
>>>>>>
>>>>>> st 30. 1. 2019 o 7:36 <[email protected]> napísal(a):
>>>>>>
>>>>>>> Not a solution, but the image needs to be transformed into
>>>>>>> grayscaleetc, (using Open CV) since OCR works best with grayed images 
>>>>>>> and
>>>>>>> images which have size of 300 dpi
>>>>>>>
>>>>>>> On Tuesday, June 12, 2018 at 12:44:21 PM UTC+5:30, Vidur Malhotra
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I tried running the tesseract on the attached image. But not
>>>>>>>> getting the desired output. My sample code:
>>>>>>>>
>>>>>>>>
>>>>>>>> import PIL
>>>>>>>> from PIL import Image
>>>>>>>> import pytesseract
>>>>>>>>
>>>>>>>> text = pytesseract.image_to_string(Image.open('test3.jpg'),
>>>>>>>> lang='eng')
>>>>>>>> print(text)
>>>>>>>>
>>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/43a5a754-4227-43b6-aec1-0261403b2029%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/43a5a754-4227-43b6-aec1-0261403b2029%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zsShhi_LsUXkCWoj8uxWxROkTL7G4RpGwzBEVm1EweTA%40mail.gmail.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zsShhi_LsUXkCWoj8uxWxROkTL7G4RpGwzBEVm1EweTA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxVGXt527-wkSfGegMtOjMU2LT0rz_H%3Dp8kQZ13CCE1ag%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxVGXt527-wkSfGegMtOjMU2LT0rz_H%3Dp8kQZ13CCE1ag%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zse4y474Uz7w0pJEuyDesRgD6fuQu_Y0cMDzGH4Ux7JA%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zse4y474Uz7w0pJEuyDesRgD6fuQu_Y0cMDzGH4Ux7JA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwkdPEWjteO1%2BJ0M0TVvfcfB9hvEX-3uQUvo_8dnr%2B2kw%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwkdPEWjteO1%2BJ0M0TVvfcfB9hvEX-3uQUvo_8dnr%2B2kw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zBcMoYnt1WPjJffLVoRXoFZcBDu5QgnQ3gOyJTobi0aw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zBcMoYnt1WPjJffLVoRXoFZcBDu5QgnQ3gOyJTobi0aw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwj%2B-xKQttOog-DCfii6TJp%2BqdWij8FpoeUpBgtJZH6Ww%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwj%2B-xKQttOog-DCfii6TJp%2BqdWij8FpoeUpBgtJZH6Ww%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CADXFd_0kO%2BUfgwDrj2otXvF98ZvwwckgEeRrPpN78VOxM7-hXQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to