On 19 July 2010 13:20, patrickq <[email protected]> wrote: > This is a great example of a serious problem with Tesseract when > analyzing any image with fonts of variable sizes such as a street > sign, flyer, business card etc. What happens is that Tesseract's > adaptive classifier makes assumptions about letter heights and uses > that knowledge when recognizing the next characters. This is right and > useful when parsing a word or (to a lesser degree but still) a > sentence with words separated by spaces because in that case it makes > sense to assume uniformity. However it is dead wrong when dealing with > different blocks. In your case, the tall bar is separated by enough > space that it should be treated as a different block and that letter > should NOT cause Tesseract to assume ANYTHING about letter height when > it tackles the next block with the phone number. > > The good news is that the fix required in Tesseract is really not that > hard, it's essentially about resetting the adaptive classifier between > blocks (separated by space larger than a blank vertically or like your > example, horizontally). Even better news: Jimmy is working on it ...
Well, it won't do him any good because he's using tessnet2, so he won't get the fix if/when I find it. Actually, my current thought is that setting segmentation to line mode might be enough to solve this problem, but I haven't gotten around to checking. I'm a little too wrapped up in internationalising Tesseract (which is an issue a little closer to my own interests). -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

