I'm not sure if user-words and/or whitelist characters are supported by LSTMs engines (versions>= 4.00) Last news I had about this it was only suported on legacy engines (v3.x) with the --oem 0 option. Maybe someone can prove correct me if I'm wrong?
On Monday, March 23, 2020 at 11:38:46 AM UTC+1, Natalia Zgirovskaya wrote: > > Hi all, > > I have an issue with providing list of user word to tesseract. I use > Windows 10. > Installed tesseract version: > > >tesseract.exe -v > tesseract v5.0.0-alpha.20191030 > leptonica-1.78.0 > libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : > libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 > Found AVX2 > Found AVX > Found FMA > Found SSE > Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 > > My test image: > > [image: test.jpg] > I have "eng.user-words" file in the directory with traindata files that > contains: > B1adeb1ab1a > > > Config file "bazaar" as follow: > load_system_dawg F > load_freq_dawg F > user_words_file path/to/eng.user-words > user_words_suffix user-words > language_model_penalty_non_freq_dict_word 1 > language_model_penalty_non_dict_word 1 > > Running this command > "C:\Program Files\Tesseract-OCR\tesseract.exe" test.jpg stdout -l eng > bazaar > gives "Bladeblabla" instead of "B1adeb1ab1a" > > As well as this command > "C:\Program Files\Tesseract-OCR\tesseract.exe" test.jpg stdout -l eng > --user-words path/to/eng.user-words > gives "Bladeblabla" instead of "B1adeb1ab1a" > > > > Where am I wrong? > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d65260f4-a705-4340-af6b-56a74a577fc0%40googlegroups.com.

