When Tesseract gets an image with a single line (as well as lines
within a page - but keeping it simple) of uniform height (all
characters have the same height), Tesseract assumes this is an all
uppercase line. This is a somewhat reasonable assumption because most
lowercase lines will have some characters that are taller (e.g. k,l,f)
and some going below the base line (e.g. p,q,j).

However, this has a very bad effect on lines that are actually all
lowercase. For example, scanning "com" on its own will return
something like "COITI" because "ITI" is the best match for 'm' if you
are forced to assume it's a tall/uppercase pattern.

Does anyone have insights on handling this issue? Is there some
parameter within Tesseract providing a better handling for this
situation?

Thanks,
Patrick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to