When I run: tesseract code_10_dejavu_sans_mono.png - I got result *6X279SWKF *- e.g. no preprocessing is needed. Also someone in past posted analyze to forum, which showed (AFAIR) than increasing size of letters over 30pt is causing problem for tesseact 4.
Zdenko st 23. 10. 2019 o 3:11 Ast <[email protected]> napísal(a): > I've also noticed inconsistencies depending on where I crop. > > I created a simple image with a 10 point font dejavu sans mono font > (code_10_dejavu_sans_mono.png) which contains *6X279SWKF* > > I pre-process it 2 ways: > > - Scale it up by 4 using (scaled_up_only.png) > > cv2.resize(img, > None, > fx=4, > fy=4, > interpolation=cv2.INTER_CUBIC) > > - Crop it first and then scale it up by 4 as above > (cropped_then_scaled_up_only.png) > > x = 10 > y = 10 > h = 20 > w = 110 > > img = img[y:y + h, x:x + w] > > I get different results. > > *tesseract --psm 13 -c > tessedit_char_whitelist=-ABCDEFGHIJKLMNOPQRSTUVWXY1234567890 > scaled_up_only.png out* > > (using > https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata > ) > > - cropped_then_scaled_up_only gives the correct value *6X279SWKF* > - scaled_up_only gives the incorrect value *6X2795WKF* > > Any insight on this and possible solutions to overcome it? I am playing > with different ways to preprocesses but there seem to be this kind of > behavior where the only difference between 2 images is that one has an > extra top row of white pixels. > > On Tuesday, October 22, 2019 at 5:32:37 AM UTC-7, zdenop wrote: >> >> I am afraid that such small faction of text (where are just letter >> commonly misinterpreted like S or 5 or ? can not recognized with 100% >> accuracy. Try to use in some context (line). >> >> Zdenko >> >> >> po 21. 10. 2019 o 20:22 Ast <[email protected]> napísal(a): >> >>> I've spent a good amount of time looking how to resolve this issue. Came >>> across this unanswered post >>> <https://groups.google.com/forum/?fromgroups#!searchin/tesseract-ocr/2s%7Csort:date/tesseract-ocr/uDxMr-65_nk/csA6aYaLCwAJ> >>> from 2017. Tried it and it is still reproducible today. There are 2 images >>> - one with the letter S, one with 2S. As a single character, the letter S >>> is detected successfully but 2S is detected as 25 >>> >>> From what I've been able to learn, this issue stems from the combination >>> of alphanumeric characters (common in receipts or codes) and how tessaract >>> tries to use dictionary words. >>> >>> *Environment:* >>> >>> tesseract 4.1.0 >>> leptonica-1.76.0 >>> libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : >>> libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 >>> Found AVX2 >>> Found AVX >>> Found SSE >>> >>> Debian 10 64bit >>> >>> I've tried changing some configurations such as* load_system_dawg=0* >>> and *load_freq_dawg=0* but without luck. >>> >>> I am fairly new to OCR so any input and feedback is greatly appreciated. >>> Thank you. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/9e8203e6-fbd5-47dc-8b2b-0327fe1e2e0a%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/9e8203e6-fbd5-47dc-8b2b-0327fe1e2e0a%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/4b6426d0-450b-4416-95c3-ba3b23f778d6%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/4b6426d0-450b-4416-95c3-ba3b23f778d6%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yuuvDj5773mhGxWy5snuc_1ZoJYFjkPzAPTKJYYAZ-Wg%40mail.gmail.com.

