characters

Ast Mon, 28 Oct 2019 12:47:11 -0700

Thanks for the insight!

On Wednesday, October 23, 2019 at 11:45:53 PM UTC-7, zdenop wrote:
>
> When I run:
> tesseract code_10_dejavu_sans_mono.png -
> I got result *6X279SWKF *- e.g. no preprocessing is needed.
> Also someone in past posted analyze to forum, which showed (AFAIR) than 
> increasing size of letters over 30pt is causing problem for tesseact 4.
>
> Zdenko
>
>
> st 23. 10. 2019 o 3:11 Ast <[email protected] <javascript:>> 
> napísal(a):
>
>> I've also noticed inconsistencies depending on where I crop.
>>
>> I created a simple image with a 10 point font dejavu sans mono font 
>> (code_10_dejavu_sans_mono.png) which contains *6X279SWKF*
>>
>> I pre-process it 2 ways:
>>
>>    - Scale it up by 4 using (scaled_up_only.png)
>>    
>> cv2.resize(img,
>>            None,
>>            fx=4,
>>            fy=4,
>>       interpolation=cv2.INTER_CUBIC)
>>
>>    - Crop it first and then scale it up by 4 as above 
>>    (cropped_then_scaled_up_only.png)
>>    
>>         x = 10
>>         y = 10
>>         h = 20
>>         w = 110
>>
>>         img = img[y:y + h, x:x + w]
>>
>> I get different results. 
>>
>> *tesseract --psm 13 -c 
>> tessedit_char_whitelist=-ABCDEFGHIJKLMNOPQRSTUVWXY1234567890 
>> scaled_up_only.png out*
>>
>> (using 
>> https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata
>> )
>>
>>    - cropped_then_scaled_up_only gives the correct value *6X279SWKF*
>>    - scaled_up_only gives the incorrect value *6X2795WKF*
>>    
>> Any insight on this and possible solutions to overcome it? I am playing 
>> with different ways to preprocesses but there seem to be this kind of 
>> behavior where the only difference between 2 images is that one has an 
>> extra top row of white pixels.
>>
>> On Tuesday, October 22, 2019 at 5:32:37 AM UTC-7, zdenop wrote:
>>>
>>> I am afraid that such small faction of text (where are just letter 
>>> commonly misinterpreted like S or 5 or ? can not recognized with 100% 
>>> accuracy. Try to use in some context (line).
>>>
>>> Zdenko
>>>
>>>
>>> po 21. 10. 2019 o 20:22 Ast <[email protected]> napísal(a):
>>>
>>>> I've spent a good amount of time looking how to resolve this issue. 
>>>> Came across this unanswered post 
>>>> <https://groups.google.com/forum/?fromgroups#!searchin/tesseract-ocr/2s%7Csort:date/tesseract-ocr/uDxMr-65_nk/csA6aYaLCwAJ>
>>>>  
>>>> from 2017. Tried it and it is still reproducible today. There are 2 images 
>>>> - one with the letter S, one with 2S. As a single character, the letter S 
>>>> is detected successfully but 2S is detected as 25
>>>>
>>>> From what I've been able to learn, this issue stems from the 
>>>> combination of alphanumeric characters (common in receipts or codes) and 
>>>> how tessaract tries to use dictionary words. 
>>>>
>>>> *Environment:*
>>>>
>>>> tesseract 4.1.0
>>>>  leptonica-1.76.0
>>>>   libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : 
>>>> libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>>>>  Found AVX2
>>>>  Found AVX
>>>>  Found SSE
>>>>
>>>> Debian 10 64bit
>>>>
>>>> I've tried changing some configurations such as* load_system_dawg=0* 
>>>> and *load_freq_dawg=0* but without luck.
>>>>
>>>> I am fairly new to OCR so any input and feedback is greatly 
>>>> appreciated. Thank you. 
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/9e8203e6-fbd5-47dc-8b2b-0327fe1e2e0a%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/9e8203e6-fbd5-47dc-8b2b-0327fe1e2e0a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/4b6426d0-450b-4416-95c3-ba3b23f778d6%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/4b6426d0-450b-4416-95c3-ba3b23f778d6%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6b0d8903-1a47-437f-973d-5be5a8932434%40googlegroups.com.

Re: [tesseract-ocr] Accuracy with non-standard words consisting of random combinations/mix of digits + letters/characters

Reply via email to