Gday. Using 4.00, compiled from release src, Linux env, LSTM engine.
I have pages produced from PDFs (ghostscript) with 300 dpi, then greyscaled using opencv. Found an issue when ocr output for some specific region has more symbols than there is in the image. Example: there is an outstanding "word" with "15" in it (actually, it is a part of date - like "15 OCT", identified as two words - which is correct). Box coords are correct, no other symbols fit in, but output from running tesseract .. --psm 11 --dpi 300 is "156" (instead of "15"). If I cut that part of the image and save it as a separate file, them ocr it with psm=6 (or 7) - output is "15" (correct). I encountered such behavior only on several symbol combinations - like "15"->"156", "08"->"0O8". Looks like when confidence level between top two identified symbols is very close - both symbols go to output, instead of one. Did anyone have same issues? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f8649172-a33b-4d29-900d-fc49ff5d42bc%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

