On Wed, Nov 12, 2014 at 2:13 AM, <[email protected]> wrote: > > > The user-patterns looks helpful, but I can't find any documentation on > formatting or how it works. Is there documentation on this somewhere? >
Did you see the man page? I had also sent link to a related discussion in the past. Search the archives for other tips. https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html says "if you pass the word *bazaar* as a trailing command line parameter to Tesseract, Tesseract will not bother loading the system dictionary nor the dictionary of frequent words and will load and use the eng.user-words and eng.user-patterns files you provided. The former is a simple word list, one per line. The format of the latter is documented in dict/trie.h on read_pattern_list()." https://code.google.com/p/tesseract-ocr/source/browse/dict/trie.h see lines 199-232 > > > On Tuesday, November 11, 2014 10:50:57 AM UTC-6, [email protected] wrote: >> >> I am working on getting Tesseract to recognize VINs for an application I >> am developing. I have a clean VIN image (work around to be black text on >> white background). Have traineddata using fonts Courier, HelveticaNeue, >> LatoBold, LatoLight, OpenSans, and RobotoSlab as a first attempt. I've also >> limited the unicharset to A-Z except I and O and 0-9. >> >> The result is not very good. It returns a great deal of characters that >> surpass the number of characters present (17). Is there a way to limit >> tesseract to only detecting a 17 character word in one line? I'd also like >> to have tesseract prefer, but not require, the last 5 characters to be >> digits. There are a few other preferences that may help too, but I want to >> start with these. I'm not sure how to go about setting up those preferences. >> >> Also, any suggestions past these on being able to clean up the OCR to >> read more correctly would be helpful. I can't post full data and image here >> (they're VINs. I'd need permission to do so), but I can say that a in one >> instance WM is coming back as 6W6M and that the digits 67258 are coming >> back as 572S5 in another. >> >> Any guidance would be appreciated. I'll provide whatever information I >> can. >> >> Thanks! >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/065a4b64-bcba-4d02-bc81-461d9ae11655%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/065a4b64-bcba-4d02-bc81-461d9ae11655%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWoMKQg7enZUxOBfe35fCthkMOLvA6MmnwtqnuiFjacEw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

