that's a well known x height issue that Tesseract faces.
currently there is no easy way to fix it (as I think).
if you know for sure your input is all lower cases, you might want to
set blacklists and whitelists accordingly

On Mar 7, 12:36 am, patrickq <[email protected]> wrote:
> When Tesseract gets an image with a single line (as well as lines
> within a page - but keeping it simple) of uniform height (all
> characters have the same height), Tesseract assumes this is an all
> uppercase line. This is a somewhat reasonable assumption because most
> lowercase lines will have some characters that are taller (e.g. k,l,f)
> and some going below the base line (e.g. p,q,j).
>
> However, this has a very bad effect on lines that are actually all
> lowercase. For example, scanning "com" on its own will return
> something like "COITI" because "ITI" is the best match for 'm' if you
> are forced to assume it's a tall/uppercase pattern.
>
> Does anyone have insights on handling this issue? Is there some
> parameter within Tesseract providing a better handling for this
> situation?
>
> Thanks,
> Patrick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to