[Sorry if double posting, but it seems that Google ate my first posting]
I'm working on document recognition for scanned bank statement. The
statements that I have are organized by lines, such as the one attached.
Because Tesseract does such a good job at detecting the areas of text, it
breaks the lines in the middle (I'm assuming this is because of the large
white space between the first block in the line (blurred for privacy
reason), and the next one ('EUR', or 'COURS').
In the hocr file, the bbox of all the elements in the line are within 2px
or so, so I could potentially rebuild a line myself. However, this seems
more like a hack. Is there a way to tell Tesseract that lines should be as
wide as the document itself? Or would there be another way to go about it?
I've tried playing with the psm option, but with no luck.
<https://lh5.googleusercontent.com/-h8GUpWGzgjs/U0aGcH22_xI/AAAAAAAAOCY/8Dd6lYGC74o/s1600/tmp.png>
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.