This means that official traineddata was not trained with some of the
characters that are there in your training text.

One way to verify this is to use the combine_tessdata command with -u to
unpack the files in it and look at the lstm-unicharset.





On Fri, 16 Aug 2019, 10:52 Jisong Xie, <[email protected]> wrote:

> This problem has upset me for a few days!
>
> I use tesstrain.sh to generate training files and eval files. e.g.
> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim
> --linedata_only \
>   --noextract_font_properties --langdata_dir ../langdata \
>   --tessdata_dir ./tessdata --output_dir ~/tesstraining/chi_simtrain
> --fontlist 'STFangsong' 'NotoSerifCJKjp-ExtraLight'
>
> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim
> --linedata_only \
>   --noextract_font_properties --langdata_dir ../langdata \
>   --tessdata_dir ./tessdata --output_dir ~/tesstraining/chi_simeval
> --fontlist "STFangsong"
>
> I can successfully training from scratch and eval the output checkpoint.
> However, when I eval with the official provided traineddata, it fails. I am
> training with chinese simplified language.
> The command is:
>
>> training/lstmeval --model ../tessdata/chi_sim.traineddata
>> --eval_listfile ~/tessnew/chi_simeval/chi_sim.training_files.txt
>>
> the model is downloaded from tessdata_best, the eval_listfile is generated
> from tesstrain.sh as above.
>
> The error is as follow:
> Truth:救市学校武汉 BLOG信息 www正式灸 姨外科楂洗浴过敏卢斐 Resources 儿需要过
> OCR  :救市学校武汉 BLOG信息 www正式灸 姨外科楂洗浴过敏卢斐 Resources 儿需要过
> Encoding of string failed! Failure bytes: ffffffe6 ffffffbc ffffffa9
> ffffffe6 ffffffb6 ffffffa1 20 ffffffe6 ffffffa1 ffffff80 ffffffe9 ffffffaa
> ffffff9c 54 58 ffffffe5 ffffff90 ffffffaf ffffffe5 ffffff8a ffffffa8 2e
> ffffffe7 ffffffa7 ffffff8d ffffffe6 ffffffa4 ffffff8d ffffffe5 ffffff9d
> ffffff91 ffffffe9 ffffff81 ffffff93 30 33 20 ffffffe5 ffffff81 ffffff9a 20
> ffffffe9 ffffff92 ffffff88 ffffffe7 ffffff81 ffffffb8 20 ffffffe9 ffffff81
> ffffffad ffffffe6 ffffffae ffffff83 20 38 31 3b ffffffe8 ffffff80 ffffff81
> ffffffe7 ffffff88 ffffffb7 ffffffe5 ffffff86 ffffff85 ffffffe5 ffffffae
> ffffffb9
> Can't encode transcription: '啊时3全新跑送防伪漩涡 桀骜TX启动.种植坑道03 做 针灸 遭殃 81;老爷内容' in
> language ''
> Encoding of string failed! Failure bytes: ffffffe8 ffffffb8 ffffff9d
> ffffffe5 ffffff8a ffffffa8 ffffffe6 ffffff80 ffffff81 20 ffffffe5 ffffff9e
> ffffff84 ffffffe6 ffffff96 ffffffad 31 ffffffe5 ffffff85 ffffffa8 ffffffe9
> ffffff9d ffffffa2 34 20 ffffffe8 ffffff84 ffffff82 ffffffe8 ffffff82
> ffffffaa 20 5b ffffffe9 ffffff80 ffffff89 ffffffe9 ffffffa1 ffffffb9 20 42
> 6f 6f 6b 6d 61 72 6b 20 3a ffffffe6 ffffffad ffffffa4 ffffffe5 ffffff9f
> ffffffba ffffffe7 ffffff9d ffffffa3 ffffffe5 ffffffbe ffffffb7 ffffffe6
> ffffffba ffffff90 ffffffe7 ffffffa0 ffffff81 55 6e 69 76 65 72 73 69 74 79
> 20 ffffffe6 ffffffb3 ffffffb0 ffffffe9 ffffff93 ffffffa2 43 2b 2b
> Can't encode transcription: '内踝动态 垄断1全面4 脂肪 [选项 Bookmark :此基督德源码University
> 泰铢C++' in language ''
> Encoding of string failed! Failure bytes: ffffffe5 ffffff97 ffffffa6 20
> ffffffe6 ffffffa0 ffffffbc ffffffe5 ffffffae ffffffbd ffffffe6 ffffff81
> ffffff95 ffffffe6 ffffffb1 ffffffa4 ffffffe7 ffffff82 ffffffb3 ffffffe6
> ffffff9d ffffff83
> Can't encode transcription: '上海 汇总顶部网站?Effect 合同do 外国共振页 (铝箔J彼此嗦 格宽恕汤炳权'
> in language ''
> Encoding of string failed! Failure bytes: ffffffe8 ffffff9e ffffffaf 29 20
> ffffffe8 ffffff85 ffffffa5 34 38 20 ffffffe8 ffffff81 ffffff94 ffffffe7
> ffffffb3 ffffffbb 43 ffffffe6 ffffffb5 ffffffb7 ffffffe9 ffffff9a ffffff90
> ffffffe7 ffffff9e ffffff92 ffffffe5 ffffff87 ffffffbb ffffffe4 ffffffb8
> ffffff80 ffffffe5 ffffff8f ffffff8d ffffffe5 ffffffba ffffff94 ffffffe8
> ffffffaf ffffffb4 ffffffe6 ffffff98 ffffff8e 20 ffffffe4 ffffffba ffffffa4
> ffffffe6 ffffff98 ffffff93 32 34 20 ffffffe7 ffffff9b ffffffaf ffffffe6
> ffffffa2 ffffffa2 ffffffe5 ffffffbb ffffff96 20 ffffffe8 ffffff81 ffffff98
> ffffffe9 ffffffb9 ffffff8a ffffffe6 ffffffa1 ffffffa5
> Can't encode transcription: '工具免责羌螯) 腥48 联系C海隐瞒击一反应说明 交易24 盯梢廖 聘鹊桥' in
> language ''
> Truth:有效期也有瘤AZ -04 09尤其火炬and 05酰或温度他乡意网友 用户 数量2 处 ICP
> OCR  :有效期也有瘤AZ -04 09尤其火炬and 05酰或温度他乡意网友 用户 数量2 处 ICE
> Encoding of string failed! Failure bytes: ffffffe6 ffffff9a ffffff84 20
> ffffffe6 ffffff96 ffffffbd ffffffe8 ffffffa1 ffffff8c ffffffef ffffffbc
> ffffff8c ffffffe5 ffffff9c ffffffa8 ffffffe7 ffffffba ffffffbf ffffffe5
> ffffff8d ffffff81 ffffffe5 ffffffad ffffff97 ffffffe6 ffffff9e ffffffb6
> ffffffe5 ffffff9f ffffff95 30 33 42 61 73 65 64 20 ffffffe7 ffffffbd
> ffffff91 ffffffe9 ffffffa1 ffffffb5 20 ffffffe5 ffffffad ffffff97 ffffffe8
> ffffffb0 ffffff9c ffffffe5 ffffffa4 ffffffaf ffffffe6 ffffff82 ffffffac
> ffffffe8 ffffff87 ffffff82 ffffffe6 ffffff89 ffffff8d ffffffe8 ffffff83
> ffffffbd ffffffe9 ffffff9c ffffff80 ffffffe9 ffffffb2 ffffff86 ffffffe8
> ffffff81 ffffff94 ffffffe8 ffffffb0 ffffff8a
> Can't encode transcription: '旗下的排名暄 施行,在线十字架埕03Based 网页 字谜夯悬臂才能需鲆联谊' in
> language ''
> Encoding of string failed! Failure bytes: ffffffe6 ffffffb6 ffffff8e
> ffffffe9 ffffff82 ffffff8b 20 ffffffe6 ffffff9c ffffffba ffffffe5 ffffff9c
> ffffffba 2e 20 ffffffe5 ffffff8d ffffff87 ffffffe7 ffffffba ffffffa7 20
> ffffffe6 ffffff8b ffffffa3 20 ffffffe9 ffffffad ffffff8d ffffffe6 ffffff9d
> ffffffa5 ffffffe5 ffffffb4 ffffff83 ffffffe5 ffffffa5 ffffff84 20 ffffffe6
> ffffff8c ffffff81 ffffffe7 ffffffbb ffffffad ffffffe5 ffffff85 ffffffb1
> ffffffe6 ffffff9c ffffff89 3e 20 ffffffe9 ffffff80 ffffff9a ffffffe6
> ffffff8a ffffffa5 2d
> Can't encode transcription: '出租机关教程峪惨绝人寰涎邋 机场. 升级 拣 魍来崃奄 持续共有> 通报-' in
> language ''
> Truth:矢量 监测vip绀贮存登记2003卑时髦局长 客服 .库原创网络情况。傻瓜12与
> OCR  :入量 监测vip绀贮存登记2003卑时紧局长 客服 .库原创网络情况。俊瓜12与
> Encoding of string failed! Failure bytes: ffffffe8 ffffff8b ffffffa3
> ffffffe4 ffffffb8 ffffffb4 ffffffe5 ffffffba ffffff8a ffffffe8 ffffffbf
> ffffff99 ffffffe5 ffffff8c ffffff88 ffffffe7 ffffff89 ffffff99 ffffffe5
> ffffff88 ffffffa9 20 ffffffe8 ffffff8f ffffff9c ffffffe7 ffffff95 ffffffa6
> ffffffe6 ffffffa0 ffffff97 ffffffe8 ffffffae ffffffb2 ffffffe8 ffffffaf
> ffffff9d 20 ffffffe5 ffffffb4 ffffffbd ffffffe5 ffffffad ffffff90 ffffffe5
> ffffff92 ffffff8c ffffffe8 ffffffb0 ffffff90 31 34 ffffffe9 ffffff80
> ffffff9a ffffffe7 ffffff89 ffffff92 33 38 ffffffe6 ffffff9c ffffff80
> ffffffe7 ffffffba ffffffba ffffffe7 ffffffbb ffffff87 ffffffe5 ffffff93
> ffffff81 ffffffe6 ffffff8e ffffffa5 ffffffe5 ffffff8f ffffff97 ffffffe6
> ffffff88 ffffff91 ffffffe4 ffffffbb ffffffac 20 ffffffe6 ffffff96 ffffffb9
> ffffffe4 ffffffbe ffffffbf
>
> I have google a lot, knowing that it might be the problem with unicharset,
> but I don't know how to solve it.
>
> In addition, when I finetune this traineddata(from tessdata_best), with
> training files also generated with tesstrain.sh, it also report this bug.
>
> Anyone who can help me? thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/927deb67-4d7d-45d9-b5ab-091702c012db%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/927deb67-4d7d-45d9-b5ab-091702c012db%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUrxgpor%2B8gEib6R3cqcUt9Wwqy9jbtZPgsYa0v3AHgqw%40mail.gmail.com.

Reply via email to