On 19 July 2010 13:20, patrickq <[email protected]> wrote:
> This is a great example of a serious problem with Tesseract when
> analyzing any image with fonts of variable sizes such as a street
> sign, flyer, business card etc. What happens is that Tesseract's
> adaptive classifier makes assumptions about letter heights and uses
> that knowledge when recognizing the next characters. This is right and
> useful when parsing a word or (to a lesser degree but still) a
> sentence with words separated by spaces because in that case it makes
> sense to assume uniformity. However it is dead wrong when dealing with
> different blocks. In your case, the tall bar is separated by enough
> space that it should be treated as a different block and that letter
> should NOT cause Tesseract to assume ANYTHING about letter height when
> it tackles the next block with the phone number.
>
> The good news is that the fix required in Tesseract is really not that
> hard, it's essentially about resetting the adaptive classifier between
> blocks (separated by space larger than a blank vertically or like your
> example, horizontally). Even better news: Jimmy is working on it ...

Well, it won't do him any good because he's using tessnet2, so he
won't get the fix if/when I find it.

Actually, my current thought is that setting segmentation to line mode
might be enough to solve this problem, but I haven't gotten around to
checking. I'm a little too wrapped up in internationalising Tesseract
(which is an issue a little closer to my own interests).

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to