I am working on getting Tesseract to recognize VINs for an application I am developing. I have a clean VIN image (work around to be black text on white background). Have traineddata using fonts Courier, HelveticaNeue, LatoBold, LatoLight, OpenSans, and RobotoSlab as a first attempt. I've also limited the unicharset to A-Z except I and O and 0-9.
The result is not very good. It returns a great deal of characters that surpass the number of characters present (17). Is there a way to limit tesseract to only detecting a 17 character word in one line? I'd also like to have tesseract prefer, but not require, the last 5 characters to be digits. There are a few other preferences that may help too, but I want to start with these. I'm not sure how to go about setting up those preferences. Also, any suggestions past these on being able to clean up the OCR to read more correctly would be helpful. I can't post full data and image here (they're VINs. I'd need permission to do so), but I can say that a in one instance WM is coming back as 6W6M and that the digits 67258 are coming back as 572S5 in another. Any guidance would be appreciated. I'll provide whatever information I can. Thanks! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1766c3a2-f13d-407b-a474-ad1fa8c7868c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

