I'm not sure if user-words and/or whitelist characters are supported by 
LSTMs engines (versions>= 4.00) Last news I had about this it was only 
suported on legacy engines (v3.x) with the --oem 0 option. Maybe someone 
can prove correct me if I'm wrong?

On Monday, March 23, 2020 at 11:38:46 AM UTC+1, Natalia Zgirovskaya wrote:
>
> Hi all,
>
> I have an issue with providing list of user word to tesseract. I use 
> Windows 10.
> Installed tesseract version:
>
> >tesseract.exe -v
> tesseract v5.0.0-alpha.20191030
>  leptonica-1.78.0
>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : 
> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>  Found AVX2
>  Found AVX
>  Found FMA
>  Found SSE
>  Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5
>
> My test image:
>
> [image: test.jpg]
> I have "eng.user-words" file in the directory with traindata files that 
> contains:
> B1adeb1ab1a
>
>
> Config file "bazaar" as follow:
> load_system_dawg     F 
> load_freq_dawg       F 
> user_words_file  path/to/eng.user-words 
> user_words_suffix user-words 
> language_model_penalty_non_freq_dict_word 1 
> language_model_penalty_non_dict_word 1
>
> Running this command
> "C:\Program Files\Tesseract-OCR\tesseract.exe" test.jpg stdout -l eng 
> bazaar
> gives "Bladeblabla" instead of "B1adeb1ab1a"
>
> As well as this command
> "C:\Program Files\Tesseract-OCR\tesseract.exe" test.jpg stdout -l eng 
> --user-words path/to/eng.user-words
> gives "Bladeblabla" instead of "B1adeb1ab1a"
>
>
>
> Where am I wrong?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d65260f4-a705-4340-af6b-56a74a577fc0%40googlegroups.com.

Reply via email to