--user-words does not currently work in tesseract4.

On Wed, Jul 10, 2019 at 7:59 PM David Novak <[email protected]> wrote:

>
> Hello,
>
> I have a custom list of words that I'd like to add to (or practically
> substitute for) the default word list in my language. Some of these words
> combine letters & digits & punctuation e.g.
> 0.5KG
> 0.5L
> 1.1L
> 1.25KG
> 108G
> 4DOG
>
> I'm using tesseract 4.0. My approach so far:
>  - unpack lang.traineddata
>  - create cus.lstm-word-dawg  (either just from my wordlist or as
> combination of standard language list + my list)
>  - create new .traineddata from cus.lstm cus.lstm-recoder
> cus.lstm-unicharset cus.lstm-word-dawg cus.traineddata
>
> It has practically no effect... Often, a word that actually is in the list
> is recognized wrongly as some string that is not in the list.
>
> I have tried to add these words using --user-words <mylist.txt>: no
> effect, or the same as my approach
> I have tried -c language_model_penalty_non_dict_word=1.0  (I thought it
> would limit the output to words in cus.lstm-word-dawg): no effect
>
> I'm out of ideas after two weeks of trying. Any tips, please?
>
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5b015d58-9958-4c1f-a330-abdb001f7957%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/5b015d58-9958-4c1f-a330-abdb001f7957%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU6S%2BwVx%2B5UUzFB04WD-j4oHNx-5%3Db9DyC7sKYyCuUmow%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to