--user-words does not currently work in tesseract4. On Wed, Jul 10, 2019 at 7:59 PM David Novak <[email protected]> wrote:
> > Hello, > > I have a custom list of words that I'd like to add to (or practically > substitute for) the default word list in my language. Some of these words > combine letters & digits & punctuation e.g. > 0.5KG > 0.5L > 1.1L > 1.25KG > 108G > 4DOG > > I'm using tesseract 4.0. My approach so far: > - unpack lang.traineddata > - create cus.lstm-word-dawg (either just from my wordlist or as > combination of standard language list + my list) > - create new .traineddata from cus.lstm cus.lstm-recoder > cus.lstm-unicharset cus.lstm-word-dawg cus.traineddata > > It has practically no effect... Often, a word that actually is in the list > is recognized wrongly as some string that is not in the list. > > I have tried to add these words using --user-words <mylist.txt>: no > effect, or the same as my approach > I have tried -c language_model_penalty_non_dict_word=1.0 (I thought it > would limit the output to words in cus.lstm-word-dawg): no effect > > I'm out of ideas after two weeks of trying. Any tips, please? > > Thanks > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/5b015d58-9958-4c1f-a330-abdb001f7957%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/5b015d58-9958-4c1f-a330-abdb001f7957%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU6S%2BwVx%2B5UUzFB04WD-j4oHNx-5%3Db9DyC7sKYyCuUmow%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

