[tesseract-ocr] [4.00] Extra symbols produced

estel4ever Fri, 01 Mar 2019 01:07:36 -0800

Gday.

Using 4.00, compiled from release src, Linux env, LSTM engine.


I have pages produced from PDFs (ghostscript) with 300 dpi, then greyscaled 
using opencv.

Found an issue when ocr output for some specific region has more symbols 
than there is in the image.

Example: there is an outstanding "word" with "15" in it (actually, it is a 
part of date - like "15 OCT", identified as two words - which is correct).
Box coords are correct, no other symbols fit in, but output from running 
tesseract .. --psm 11 --dpi 300 is "156" (instead of "15").

If I cut that part of the image and save it as a separate file, them ocr it 
with psm=6 (or 7) - output is "15" (correct).

I encountered such behavior only on several symbol combinations - like 
"15"->"156", "08"->"0O8". Looks like when confidence level between top two 
identified symbols is very close - both symbols go to output, instead of 
one.

Did anyone have same issues?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f8649172-a33b-4d29-900d-fc49ff5d42bc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] [4.00] Extra symbols produced

Reply via email to