>From time to time, Patrick points out to me some really odd errors he
gets that are caused by the fixxht code (the code that second guesses
case). Part of this is a missing feature I mentioned a little back -
that Tesseract lacks resegmentation of blocks, and so tries to
normalise to an unreasonable baseline on text with different heights -
but that's not the only thing that's happening.

X-height fixing happens quite late in processing, and then, only if
nothing reasonable has been seen. So switching off x-height fixing
will only get you results that are still crap, but in a less
surprising way. In the 'John Doe' image, after it was turned off, the
result was 'Jo |ih Dob', or something like it - the spacing error is
another one that needs fixing, the misread 'h' and 'n' are
understandable, and probably would not have happened, had the spacing
error not been there. The one that really surprises me is 'b' instead
of 'e' (possibly a speck being thresholded into something larger).

So, the moral is, if you're getting uppercase gibberish instead of
lower case output, it's the fixxht code; but something else went wrong
in recognition to get you there.

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to