I've also noticed inconsistencies depending on where I crop.
I created a simple image with a 10 point font dejavu sans mono font
(code_10_dejavu_sans_mono.png) which contains *6X279SWKF*
I pre-process it 2 ways:
- Scale it up by 4 using (scaled_up_only.png)
cv2.resize(img,
None,
fx=4,
fy=4,
interpolation=cv2.INTER_CUBIC)
- Crop it first and then scale it up by 4 as above
(cropped_then_scaled_up_only.png)
x = 10
y = 10
h = 20
w = 110
img = img[y:y + h, x:x + w]
I get different results.
*tesseract --psm 13 -c
tessedit_char_whitelist=-ABCDEFGHIJKLMNOPQRSTUVWXY1234567890
scaled_up_only.png out*
(using
https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata)
- cropped_then_scaled_up_only gives the correct value *6X279SWKF*
- scaled_up_only gives the incorrect value *6X2795WKF*
Any insight on this and possible solutions to overcome it? I am playing
with different ways to preprocesses but there seem to be this kind of
behavior where the only difference between 2 images is that one has an
extra top row of white pixels.
On Tuesday, October 22, 2019 at 5:32:37 AM UTC-7, zdenop wrote:
>
> I am afraid that such small faction of text (where are just letter
> commonly misinterpreted like S or 5 or ? can not recognized with 100%
> accuracy. Try to use in some context (line).
>
> Zdenko
>
>
> po 21. 10. 2019 o 20:22 Ast <[email protected] <javascript:>>
> napĂsal(a):
>
>> I've spent a good amount of time looking how to resolve this issue. Came
>> across this unanswered post
>> <https://groups.google.com/forum/?fromgroups#!searchin/tesseract-ocr/2s%7Csort:date/tesseract-ocr/uDxMr-65_nk/csA6aYaLCwAJ>
>>
>> from 2017. Tried it and it is still reproducible today. There are 2 images
>> - one with the letter S, one with 2S. As a single character, the letter S
>> is detected successfully but 2S is detected as 25
>>
>> From what I've been able to learn, this issue stems from the combination
>> of alphanumeric characters (common in receipts or codes) and how tessaract
>> tries to use dictionary words.
>>
>> *Environment:*
>>
>> tesseract 4.1.0
>> leptonica-1.76.0
>> libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 :
>> libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>> Found AVX2
>> Found AVX
>> Found SSE
>>
>> Debian 10 64bit
>>
>> I've tried changing some configurations such as* load_system_dawg=0* and
>> *load_freq_dawg=0* but without luck.
>>
>> I am fairly new to OCR so any input and feedback is greatly appreciated.
>> Thank you.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/9e8203e6-fbd5-47dc-8b2b-0327fe1e2e0a%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/tesseract-ocr/9e8203e6-fbd5-47dc-8b2b-0327fe1e2e0a%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/4b6426d0-450b-4416-95c3-ba3b23f778d6%40googlegroups.com.