Thanks for the insight! On Wednesday, October 23, 2019 at 11:45:53 PM UTC-7, zdenop wrote: > > When I run: > tesseract code_10_dejavu_sans_mono.png - > I got result *6X279SWKF *- e.g. no preprocessing is needed. > Also someone in past posted analyze to forum, which showed (AFAIR) than > increasing size of letters over 30pt is causing problem for tesseact 4. > > Zdenko > > > st 23. 10. 2019 o 3:11 Ast <[email protected] <javascript:>> > napísal(a): > >> I've also noticed inconsistencies depending on where I crop. >> >> I created a simple image with a 10 point font dejavu sans mono font >> (code_10_dejavu_sans_mono.png) which contains *6X279SWKF* >> >> I pre-process it 2 ways: >> >> - Scale it up by 4 using (scaled_up_only.png) >> >> cv2.resize(img, >> None, >> fx=4, >> fy=4, >> interpolation=cv2.INTER_CUBIC) >> >> - Crop it first and then scale it up by 4 as above >> (cropped_then_scaled_up_only.png) >> >> x = 10 >> y = 10 >> h = 20 >> w = 110 >> >> img = img[y:y + h, x:x + w] >> >> I get different results. >> >> *tesseract --psm 13 -c >> tessedit_char_whitelist=-ABCDEFGHIJKLMNOPQRSTUVWXY1234567890 >> scaled_up_only.png out* >> >> (using >> https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata >> ) >> >> - cropped_then_scaled_up_only gives the correct value *6X279SWKF* >> - scaled_up_only gives the incorrect value *6X2795WKF* >> >> Any insight on this and possible solutions to overcome it? I am playing >> with different ways to preprocesses but there seem to be this kind of >> behavior where the only difference between 2 images is that one has an >> extra top row of white pixels. >> >> On Tuesday, October 22, 2019 at 5:32:37 AM UTC-7, zdenop wrote: >>> >>> I am afraid that such small faction of text (where are just letter >>> commonly misinterpreted like S or 5 or ? can not recognized with 100% >>> accuracy. Try to use in some context (line). >>> >>> Zdenko >>> >>> >>> po 21. 10. 2019 o 20:22 Ast <[email protected]> napísal(a): >>> >>>> I've spent a good amount of time looking how to resolve this issue. >>>> Came across this unanswered post >>>> <https://groups.google.com/forum/?fromgroups#!searchin/tesseract-ocr/2s%7Csort:date/tesseract-ocr/uDxMr-65_nk/csA6aYaLCwAJ> >>>> >>>> from 2017. Tried it and it is still reproducible today. There are 2 images >>>> - one with the letter S, one with 2S. As a single character, the letter S >>>> is detected successfully but 2S is detected as 25 >>>> >>>> From what I've been able to learn, this issue stems from the >>>> combination of alphanumeric characters (common in receipts or codes) and >>>> how tessaract tries to use dictionary words. >>>> >>>> *Environment:* >>>> >>>> tesseract 4.1.0 >>>> leptonica-1.76.0 >>>> libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : >>>> libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 >>>> Found AVX2 >>>> Found AVX >>>> Found SSE >>>> >>>> Debian 10 64bit >>>> >>>> I've tried changing some configurations such as* load_system_dawg=0* >>>> and *load_freq_dawg=0* but without luck. >>>> >>>> I am fairly new to OCR so any input and feedback is greatly >>>> appreciated. Thank you. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/9e8203e6-fbd5-47dc-8b2b-0327fe1e2e0a%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/9e8203e6-fbd5-47dc-8b2b-0327fe1e2e0a%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/4b6426d0-450b-4416-95c3-ba3b23f778d6%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/4b6426d0-450b-4416-95c3-ba3b23f778d6%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6b0d8903-1a47-437f-973d-5be5a8932434%40googlegroups.com.

