Tesseract's space handling needs a total rewrite - it's not be saying so, it's Ray Smith in a previous post in this forum.
Specifically, after the digit '1' Tesseract appears to struggle more than usual with spaces, probably because it attaches too much importance to the width of the last character instead of just considering the average width of digits on that line. We have specific code post-Tesseract with ScanBizCards that handles spaces: Just Tesseract 3.0: http://www.scanbizcards.com/tess.jpg After we check spaces: http://www.scanbizcards.com/tess_corrected.jpg A simple approach as follows should work: - compute the average width of digits, excluding the digit '1' because is has an unusually narrow width - after a '1' where Tess chose to have a space, compare the spacing between '1' and the next digit to the average digit width and if less than what you visually consider a space (take your pick what that threshold is), remove the space - NOTE: this doesn't work for italic lines so well, either abstain in that case (you can detect italic based on the space between digits being negative or very small) or adjust for it. If you care to use our empirical ratio we do this: - calculate the "real gap" as the gap minus the average spacing on that line - if "real gap" divided by the average digit width is less than 0.695 we consider it a bad space Patrick On Jul 29, 11:43 am, KAH <[email protected]> wrote: > I have an image here:http://dl.dropbox.com/u/1531272/pg1-CROP-OCR.jpg > This image when run through the tesseract renders out three words... > > 05 > 04571 > 6 > > I have adjusted tosp_table_xht_sp_ratio to no avail... I cannot > understand why 6 is not included in the 04571 word. In looking at the > characters that are returned the height of 1 is 69px and the space to > the next character 6 is 12px. Even using the default value for > tosp_table_xht_sp_ratio of .33 should yield a space of 69*.33 = 23px > for spacing - which would make this 6 come into the same grouping. > > Can anyone offer a view into this that helps me understand why the 6 > is not read as part of the 045716 word? > > Thanks -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

