[tesseract-ocr] Train Tesseract to Only Find a Single 17 Character Word

steven Tue, 11 Nov 2014 08:54:36 -0800


I am working on getting Tesseract to recognize VINs for an application I am 
developing. I have a clean VIN image (work around to be black text on white 
background). Have traineddata using fonts Courier, HelveticaNeue, LatoBold, 
LatoLight, OpenSans, and RobotoSlab as a first attempt. I've also limited 
the unicharset to A-Z except I and O and 0-9.


The result is not very good. It returns a great deal of characters that 
surpass the number of characters present (17). Is there a way to limit 
tesseract to only detecting a 17 character word in one line? I'd also like 
to have tesseract prefer, but not require, the last 5 characters to be 
digits. There are a few other preferences that may help too, but I want to 
start with these. I'm not sure how to go about setting up those preferences.

Also, any suggestions past these on being able to clean up the OCR to read 
more correctly would be helpful. I can't post full data and image here 
(they're VINs. I'd need permission to do so), but I can say that a in one 
instance WM is coming back as 6W6M and that the digits 67258 are coming 
back as 572S5 in another.

Any guidance would be appreciated. I'll provide whatever information I can.

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1766c3a2-f13d-407b-a474-ad1fa8c7868c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Train Tesseract to Only Find a Single 17 Character Word

Reply via email to