When Tesseract gets an image with a single line (as well as lines within a page - but keeping it simple) of uniform height (all characters have the same height), Tesseract assumes this is an all uppercase line. This is a somewhat reasonable assumption because most lowercase lines will have some characters that are taller (e.g. k,l,f) and some going below the base line (e.g. p,q,j).
However, this has a very bad effect on lines that are actually all lowercase. For example, scanning "com" on its own will return something like "COITI" because "ITI" is the best match for 'm' if you are forced to assume it's a tall/uppercase pattern. Does anyone have insights on handling this issue? Is there some parameter within Tesseract providing a better handling for this situation? Thanks, Patrick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

