I think the reason is that your input is bad so the model is confused and
a few pixels are enough to see an extra letter.

Your input is "bad" because it is different from the one used to train the
neural network. The difference between the two images is small but the
difference from the training data for both is big.

If you improve your image with zero borders, less noise and a much stronger
contrast, maybe even straighten the text this kind of problem should become
much less common.

If you want to understand a little more why this is possible read something
about how an LSTM ocr works. This is likely something in the step that
tries to decide the letters from the neural network output (beam search,
CTC). Not a bug just how it works.

I do not think there is much you can do, parameters, etc., other than
improve your image or tesseract. Sometimes it happens even with fine tuned
models.



Lorenzo




Il giorno mer 15 lug 2020 alle ore 20:55 MysteriousGuy <gyt...@gmail.com>
ha scritto:

> This seems like an ad-hoc approach. I am already converting images to
> grayscale. If I apply blurring, binarisation, etc. then I will solve this
> case but I will prompt another case to fail as a result. There is something
> with tesseract that fails to generalize on clearly near-identical images,
> and I am interested in what is it.
>
> 2020 m. liepa 15 d., trečiadienis 12:08:33 UTC+3, Tuan Ardouin rašė:
>>
>> You need to apply some pre-processing to your image.
>>
>> On Wednesday, July 15, 2020 at 9:01:14 AM UTC+2, MysteriousGuy wrote:
>>>
>>> Hi. Latest stable version (4.1.1) produces the same error
>>>
>>> 2020 m. liepa 14 d., antradienis 17:13:40 UTC+3, zdenop rašė:
>>>>
>>>> Try to use the latest version of tesseract.
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> ut 14. 7. 2020 o 16:04 MysteriousGuy <gyt...@gmail.com> napísal(a):
>>>>
>>>>> I am using Tesseract to extract text from images attached. For some
>>>>> reason, even though the images are nearly identical, tesseract makes a
>>>>> mistake in one of them: for 'bad.png' the output is ELHADIJ, whereas for
>>>>> 'good.png' it is ELHADJ
>>>>>
>>>>> Here is what I have and done:
>>>>>
>>>>>    - tesseract version: 4.0.0-beta.1
>>>>>    - leptonica version: 1.75.3
>>>>>    - I use English .traineddata file from here:
>>>>>    
>>>>> https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata
>>>>>    - I tried these page segmentation modes: 3, 7, 8, 13 - the mistake
>>>>>    is always there.
>>>>>
>>>>> So the commands I ran were
>>>>>
>>>>> tesseract good.png output1 -l eng --psm 8
>>>>> tesseract bad.png output2 -l eng --psm 8
>>>>>
>>>>> and similarly for other PSMs
>>>>>
>>>>>
>>>>> My question is: how do I make tesseract more robust? Why does it make
>>>>> a mistake in one case but not in the other?
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesser...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/81a83479-b266-4686-a2d8-fae2d5916831o%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/81a83479-b266-4686-a2d8-fae2d5916831o%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d6df0771-04e5-4e78-9109-28d91e2c2f2do%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/d6df0771-04e5-4e78-9109-28d91e2c2f2do%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyMcJm41Z9DZ_uAJ8anY7aPmM-uA0aSTzotxp%3DJ6_i4CA%40mail.gmail.com.

Reply via email to