Dmitri gave the detailed answer. A short-cut perhaps: try higher resolution images.
Another short-cut: pre-process the images with graphicsmagick to get a photocopy-like effect, so that tesseract can choose a correct threshold value. My previous posts might help. On Saturday, May 30, 2015 at 10:17:56 AM UTC-4, S Kirkwood wrote: > > Hi, I am working on a project that requires OCR. I have not used > Tesseract much before, aside from using it on some basic examples using the > command line tool. My goal is to use OCR on insurance cards to get all of > the characters and then find certain information such as the ID of the > cardholder from the output. In this, accuracy is critical, as a single > misread character messes up the entire ID. > > My concern stems from this need for extreme accuracy, which from this > discussion thread > <https://groups.google.com/forum/#!topic/tesseract-ocr/YO9XhsAWW_k>, > appears would only be possible by running the character recognition on each > individual character on the card. The following quote is where I draw most > of my worries from: > > But if accuracy is critical in your app, in the long run I would >> absolutely avoid using any parts of Tesseract except char classifier. I.e. >> crop every single char out of your source image and run Tess in the single >> char PSM. I think it's should be easy as long as location of every >> character is quite stable among your source images. ImageMagick/shell >> scripts would suffice. >> > > However, the images I will be processing differ vastly in layout - not > stable like the example I linked to. Some examples of how the format may > differ follow: > > > <https://lh3.googleusercontent.com/-mPGe6BSmfSU/VWiQQMzkD8I/AAAAAAAAAA8/1WwUjQpPRkE/s1600/Sample_Card_2.jpg> > > <https://lh3.googleusercontent.com/-ovzD1qb6x8g/VWiQWG6zP-I/AAAAAAAAABE/Sb6vNLozPoY/s1600/Sample_Card_3.jpg> > > <https://lh3.googleusercontent.com/-K78wt72YzXA/VWiQinq_wiI/AAAAAAAAABM/wcYKEzXBYdI/s1600/Sample_Card_4.jpg> > > > I have run Tesseract on samples and while it works for most of the > characters, there will be cases where it misreads a single character (such > as registering an "H " when the character is a "W") or even worse an entire > phrase(such as registering "No New Rum" when the phrase is actually "No > Referral Required"). Because of errors like this, I would not be able to > use the output that Tesseract currently gives me. > > Is there a realistic way to use Tesseract for this kind of endeavor? > > Thanks for taking the time to read, > Scott > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6dcd3fa9-03d4-403a-9f1d-34e30b2a936c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

