I discovered https://code.google.com/p/tesseract-ocr/wiki/FAQ#Output_without_result_or_bad_output shortly after posting my question. So I modified my image to include 20px of padding around the word and now my tesseract test for that image is passing. This seems to have generally improved the reliability for my current set of test images, but still far from perfect. Also, most images actually have more than 4 characters.
On Tuesday, 7 January 2014 08:47:33 UTC-8, sventech wrote: > > Hi Michael, > This is a known issue -- tesseract does not handle very small isolated > text well by default. usually one needs 4 or more characters. Have you > tried different page segmentation modes (PSM)? > --Sven > > > On Mon, Jan 6, 2014 at 5:53 PM, Michael Beauregard < > [email protected] <javascript:>> wrote: > >> Hey everyone, >> >> I hesitate to post this as I'm likely just making rookie mistakes, but >> perhaps this particular test image will prove to be useful for learning >> about tesseract. >> >> My application uses domain specific constraints to pre-segment the blocks >> of interest and each image passed to tesseract will always contain a single >> line of text. The attached input image containing 'AB' is a good example of >> the type of images I expect to have after segmentation. Several images with >> phone numbers or addresses are correctly recognized by tesseract, but I was >> surprised to see that the output for the 'AB' image was completely wrong. >> >> Although I'm using the api in my application, I was able to reproduce the >> exact same results with the command line using the following command: >> >> tesseract AB.png AB-output -psm 6 >> >> >> the resulting 'AB-output.txt' contains: >> >> Eā-3 >> >> >> Having read through many past messages in the group, I'm worried that the >> only way to get reliable results from tesseract is to train it with my >> input images. However, considering that many other fields from this same >> label are interpreted correctly, I feel that there must be something else >> going on. Any help understanding what is going on here would be wonderful. >> >> Cheers, >> >> Michael >> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected]<javascript:> >> To unsubscribe from this group, send email to >> [email protected] <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/groups/opt_out. >> > > > > -- > ``All that is gold does not glitter, > not all those who wander are lost; > the old that is strong does not wither, > deep roots are not reached by the frost. > From the ashes a fire shall be woken, > a light from the shadows shall spring; > renewed shall be blade that was broken, > the crownless again shall be king.ā > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

