that's a well known x height issue that Tesseract faces. currently there is no easy way to fix it (as I think). if you know for sure your input is all lower cases, you might want to set blacklists and whitelists accordingly
On Mar 7, 12:36 am, patrickq <[email protected]> wrote: > When Tesseract gets an image with a single line (as well as lines > within a page - but keeping it simple) of uniform height (all > characters have the same height), Tesseract assumes this is an all > uppercase line. This is a somewhat reasonable assumption because most > lowercase lines will have some characters that are taller (e.g. k,l,f) > and some going below the base line (e.g. p,q,j). > > However, this has a very bad effect on lines that are actually all > lowercase. For example, scanning "com" on its own will return > something like "COITI" because "ITI" is the best match for 'm' if you > are forced to assume it's a tall/uppercase pattern. > > Does anyone have insights on handling this issue? Is there some > parameter within Tesseract providing a better handling for this > situation? > > Thanks, > Patrick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

