[tesseract-ocr] Defining what is a line in Tesseract (or forcing a minimum width for a line)

Sébastien Cuendet Thu, 10 Apr 2014 06:33:46 -0700

[Sorry if double posting, but it seems that Google ate my first posting]

I'm working on document recognition for scanned bank statement. The 
statements that I have are organized by lines, such as the one attached. 
Because Tesseract does such a good job at detecting the areas of text, it 
breaks the lines in the middle (I'm assuming this is because of the large 
white space between the first block in the line (blurred for privacy 
reason), and the next one ('EUR', or 'COURS').


In the hocr file, the bbox of all the elements in the line are within 2px 
or so, so I could potentially rebuild a line myself. However, this seems 
more like a hack. Is there a way to tell Tesseract that lines should be as 
wide as the document itself? Or would there be another way to go about it? 
I've tried playing with the psm option, but with no luck.


<https://lh5.googleusercontent.com/-h8GUpWGzgjs/U0aGcH22_xI/AAAAAAAAOCY/8Dd6lYGC74o/s1600/tmp.png>


-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Defining what is a line in Tesseract (or forcing a minimum width for a line)

Reply via email to