>From time to time, Patrick points out to me some really odd errors he gets that are caused by the fixxht code (the code that second guesses case). Part of this is a missing feature I mentioned a little back - that Tesseract lacks resegmentation of blocks, and so tries to normalise to an unreasonable baseline on text with different heights - but that's not the only thing that's happening.
X-height fixing happens quite late in processing, and then, only if nothing reasonable has been seen. So switching off x-height fixing will only get you results that are still crap, but in a less surprising way. In the 'John Doe' image, after it was turned off, the result was 'Jo |ih Dob', or something like it - the spacing error is another one that needs fixing, the misread 'h' and 'n' are understandable, and probably would not have happened, had the spacing error not been there. The one that really surprises me is 'b' instead of 'e' (possibly a speck being thresholded into something larger). So, the moral is, if you're getting uppercase gibberish instead of lower case output, it's the fixxht code; but something else went wrong in recognition to get you there. -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

