This is a great example of a serious problem with Tesseract when analyzing any image with fonts of variable sizes such as a street sign, flyer, business card etc. What happens is that Tesseract's adaptive classifier makes assumptions about letter heights and uses that knowledge when recognizing the next characters. This is right and useful when parsing a word or (to a lesser degree but still) a sentence with words separated by spaces because in that case it makes sense to assume uniformity. However it is dead wrong when dealing with different blocks. In your case, the tall bar is separated by enough space that it should be treated as a different block and that letter should NOT cause Tesseract to assume ANYTHING about letter height when it tackles the next block with the phone number.
The good news is that the fix required in Tesseract is really not that hard, it's essentially about resetting the adaptive classifier between blocks (separated by space larger than a blank vertically or like your example, horizontally). Even better news: Jimmy is working on it ... On Jul 18, 11:40 pm, KAH <[email protected]> wrote: > I have two files.... > > http://dl.dropbox.com/u/1531272/pg1-CROP.jpg > andhttp://dl.dropbox.com/u/1531272/pg1-CROP-Lines.jpg > > Note on the "Lines" file there are dark lines on the left and right > side of this image. > I am trying to understand why the tessnet dll would render such > different readings for this image. > > Can anyone offer some help or understanding regarding how this product > reads that would cause this? Additionally if there are any variables > I would set that would help I would love to have some direction on > them. > > Thank you for your help. > KAH -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

