Tesseract's space handling needs a total rewrite - it's not be saying
so, it's Ray Smith in a previous post in this forum.

Specifically, after the digit '1' Tesseract appears to struggle more
than usual with spaces, probably because it attaches too much
importance to the width of the last character instead of just
considering the average width of digits on that line. We have specific
code post-Tesseract with ScanBizCards that handles spaces:

Just Tesseract 3.0:
http://www.scanbizcards.com/tess.jpg

After we check spaces:
http://www.scanbizcards.com/tess_corrected.jpg

A simple approach as follows should work:
- compute the average width of digits, excluding the digit '1' because
is has an unusually narrow width
- after a '1' where Tess chose to have a space, compare the spacing
between '1' and the next digit to the average digit width and if less
than what you visually consider a space (take your pick what that
threshold is), remove the space
- NOTE: this doesn't work for italic lines so well, either abstain in
that case (you can detect italic based on the space between digits
being negative or very small) or adjust for it.

If you care to use our empirical ratio we do this:
- calculate the "real gap" as the gap minus the average spacing on
that line
- if "real gap" divided by the average digit width is less than 0.695
we consider it a bad space

Patrick

On Jul 29, 11:43 am, KAH <[email protected]> wrote:
> I have an image here:http://dl.dropbox.com/u/1531272/pg1-CROP-OCR.jpg
> This image when run through the tesseract renders out three words...
>
> 05
> 04571
> 6
>
> I have adjusted tosp_table_xht_sp_ratio to no avail... I cannot
> understand why 6 is not included in the 04571 word. In looking at the
> characters that are returned the height of 1 is 69px and the space to
> the next character 6 is 12px. Even using the default value for
> tosp_table_xht_sp_ratio of .33 should yield a space of 69*.33 = 23px
> for spacing - which would make this 6 come into the same grouping.
>
> Can anyone offer a view into this that helps me understand why the 6
> is not read as part of the 045716 word?
>
> Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to