This means that official traineddata was not trained with some of the characters that are there in your training text.
One way to verify this is to use the combine_tessdata command with -u to unpack the files in it and look at the lstm-unicharset. On Fri, 16 Aug 2019, 10:52 Jisong Xie, <[email protected]> wrote: > This problem has upset me for a few days! > > I use tesstrain.sh to generate training files and eval files. e.g. > src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim > --linedata_only \ > --noextract_font_properties --langdata_dir ../langdata \ > --tessdata_dir ./tessdata --output_dir ~/tesstraining/chi_simtrain > --fontlist 'STFangsong' 'NotoSerifCJKjp-ExtraLight' > > src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim > --linedata_only \ > --noextract_font_properties --langdata_dir ../langdata \ > --tessdata_dir ./tessdata --output_dir ~/tesstraining/chi_simeval > --fontlist "STFangsong" > > I can successfully training from scratch and eval the output checkpoint. > However, when I eval with the official provided traineddata, it fails. I am > training with chinese simplified language. > The command is: > >> training/lstmeval --model ../tessdata/chi_sim.traineddata >> --eval_listfile ~/tessnew/chi_simeval/chi_sim.training_files.txt >> > the model is downloaded from tessdata_best, the eval_listfile is generated > from tesstrain.sh as above. > > The error is as follow: > Truth:救市学校武汉 BLOG信息 www正式灸 姨外科楂洗浴过敏卢斐 Resources 儿需要过 > OCR :救市学校武汉 BLOG信息 www正式灸 姨外科楂洗浴过敏卢斐 Resources 儿需要过 > Encoding of string failed! Failure bytes: ffffffe6 ffffffbc ffffffa9 > ffffffe6 ffffffb6 ffffffa1 20 ffffffe6 ffffffa1 ffffff80 ffffffe9 ffffffaa > ffffff9c 54 58 ffffffe5 ffffff90 ffffffaf ffffffe5 ffffff8a ffffffa8 2e > ffffffe7 ffffffa7 ffffff8d ffffffe6 ffffffa4 ffffff8d ffffffe5 ffffff9d > ffffff91 ffffffe9 ffffff81 ffffff93 30 33 20 ffffffe5 ffffff81 ffffff9a 20 > ffffffe9 ffffff92 ffffff88 ffffffe7 ffffff81 ffffffb8 20 ffffffe9 ffffff81 > ffffffad ffffffe6 ffffffae ffffff83 20 38 31 3b ffffffe8 ffffff80 ffffff81 > ffffffe7 ffffff88 ffffffb7 ffffffe5 ffffff86 ffffff85 ffffffe5 ffffffae > ffffffb9 > Can't encode transcription: '啊时3全新跑送防伪漩涡 桀骜TX启动.种植坑道03 做 针灸 遭殃 81;老爷内容' in > language '' > Encoding of string failed! Failure bytes: ffffffe8 ffffffb8 ffffff9d > ffffffe5 ffffff8a ffffffa8 ffffffe6 ffffff80 ffffff81 20 ffffffe5 ffffff9e > ffffff84 ffffffe6 ffffff96 ffffffad 31 ffffffe5 ffffff85 ffffffa8 ffffffe9 > ffffff9d ffffffa2 34 20 ffffffe8 ffffff84 ffffff82 ffffffe8 ffffff82 > ffffffaa 20 5b ffffffe9 ffffff80 ffffff89 ffffffe9 ffffffa1 ffffffb9 20 42 > 6f 6f 6b 6d 61 72 6b 20 3a ffffffe6 ffffffad ffffffa4 ffffffe5 ffffff9f > ffffffba ffffffe7 ffffff9d ffffffa3 ffffffe5 ffffffbe ffffffb7 ffffffe6 > ffffffba ffffff90 ffffffe7 ffffffa0 ffffff81 55 6e 69 76 65 72 73 69 74 79 > 20 ffffffe6 ffffffb3 ffffffb0 ffffffe9 ffffff93 ffffffa2 43 2b 2b > Can't encode transcription: '内踝动态 垄断1全面4 脂肪 [选项 Bookmark :此基督德源码University > 泰铢C++' in language '' > Encoding of string failed! Failure bytes: ffffffe5 ffffff97 ffffffa6 20 > ffffffe6 ffffffa0 ffffffbc ffffffe5 ffffffae ffffffbd ffffffe6 ffffff81 > ffffff95 ffffffe6 ffffffb1 ffffffa4 ffffffe7 ffffff82 ffffffb3 ffffffe6 > ffffff9d ffffff83 > Can't encode transcription: '上海 汇总顶部网站?Effect 合同do 外国共振页 (铝箔J彼此嗦 格宽恕汤炳权' > in language '' > Encoding of string failed! Failure bytes: ffffffe8 ffffff9e ffffffaf 29 20 > ffffffe8 ffffff85 ffffffa5 34 38 20 ffffffe8 ffffff81 ffffff94 ffffffe7 > ffffffb3 ffffffbb 43 ffffffe6 ffffffb5 ffffffb7 ffffffe9 ffffff9a ffffff90 > ffffffe7 ffffff9e ffffff92 ffffffe5 ffffff87 ffffffbb ffffffe4 ffffffb8 > ffffff80 ffffffe5 ffffff8f ffffff8d ffffffe5 ffffffba ffffff94 ffffffe8 > ffffffaf ffffffb4 ffffffe6 ffffff98 ffffff8e 20 ffffffe4 ffffffba ffffffa4 > ffffffe6 ffffff98 ffffff93 32 34 20 ffffffe7 ffffff9b ffffffaf ffffffe6 > ffffffa2 ffffffa2 ffffffe5 ffffffbb ffffff96 20 ffffffe8 ffffff81 ffffff98 > ffffffe9 ffffffb9 ffffff8a ffffffe6 ffffffa1 ffffffa5 > Can't encode transcription: '工具免责羌螯) 腥48 联系C海隐瞒击一反应说明 交易24 盯梢廖 聘鹊桥' in > language '' > Truth:有效期也有瘤AZ -04 09尤其火炬and 05酰或温度他乡意网友 用户 数量2 处 ICP > OCR :有效期也有瘤AZ -04 09尤其火炬and 05酰或温度他乡意网友 用户 数量2 处 ICE > Encoding of string failed! Failure bytes: ffffffe6 ffffff9a ffffff84 20 > ffffffe6 ffffff96 ffffffbd ffffffe8 ffffffa1 ffffff8c ffffffef ffffffbc > ffffff8c ffffffe5 ffffff9c ffffffa8 ffffffe7 ffffffba ffffffbf ffffffe5 > ffffff8d ffffff81 ffffffe5 ffffffad ffffff97 ffffffe6 ffffff9e ffffffb6 > ffffffe5 ffffff9f ffffff95 30 33 42 61 73 65 64 20 ffffffe7 ffffffbd > ffffff91 ffffffe9 ffffffa1 ffffffb5 20 ffffffe5 ffffffad ffffff97 ffffffe8 > ffffffb0 ffffff9c ffffffe5 ffffffa4 ffffffaf ffffffe6 ffffff82 ffffffac > ffffffe8 ffffff87 ffffff82 ffffffe6 ffffff89 ffffff8d ffffffe8 ffffff83 > ffffffbd ffffffe9 ffffff9c ffffff80 ffffffe9 ffffffb2 ffffff86 ffffffe8 > ffffff81 ffffff94 ffffffe8 ffffffb0 ffffff8a > Can't encode transcription: '旗下的排名暄 施行,在线十字架埕03Based 网页 字谜夯悬臂才能需鲆联谊' in > language '' > Encoding of string failed! Failure bytes: ffffffe6 ffffffb6 ffffff8e > ffffffe9 ffffff82 ffffff8b 20 ffffffe6 ffffff9c ffffffba ffffffe5 ffffff9c > ffffffba 2e 20 ffffffe5 ffffff8d ffffff87 ffffffe7 ffffffba ffffffa7 20 > ffffffe6 ffffff8b ffffffa3 20 ffffffe9 ffffffad ffffff8d ffffffe6 ffffff9d > ffffffa5 ffffffe5 ffffffb4 ffffff83 ffffffe5 ffffffa5 ffffff84 20 ffffffe6 > ffffff8c ffffff81 ffffffe7 ffffffbb ffffffad ffffffe5 ffffff85 ffffffb1 > ffffffe6 ffffff9c ffffff89 3e 20 ffffffe9 ffffff80 ffffff9a ffffffe6 > ffffff8a ffffffa5 2d > Can't encode transcription: '出租机关教程峪惨绝人寰涎邋 机场. 升级 拣 魍来崃奄 持续共有> 通报-' in > language '' > Truth:矢量 监测vip绀贮存登记2003卑时髦局长 客服 .库原创网络情况。傻瓜12与 > OCR :入量 监测vip绀贮存登记2003卑时紧局长 客服 .库原创网络情况。俊瓜12与 > Encoding of string failed! Failure bytes: ffffffe8 ffffff8b ffffffa3 > ffffffe4 ffffffb8 ffffffb4 ffffffe5 ffffffba ffffff8a ffffffe8 ffffffbf > ffffff99 ffffffe5 ffffff8c ffffff88 ffffffe7 ffffff89 ffffff99 ffffffe5 > ffffff88 ffffffa9 20 ffffffe8 ffffff8f ffffff9c ffffffe7 ffffff95 ffffffa6 > ffffffe6 ffffffa0 ffffff97 ffffffe8 ffffffae ffffffb2 ffffffe8 ffffffaf > ffffff9d 20 ffffffe5 ffffffb4 ffffffbd ffffffe5 ffffffad ffffff90 ffffffe5 > ffffff92 ffffff8c ffffffe8 ffffffb0 ffffff90 31 34 ffffffe9 ffffff80 > ffffff9a ffffffe7 ffffff89 ffffff92 33 38 ffffffe6 ffffff9c ffffff80 > ffffffe7 ffffffba ffffffba ffffffe7 ffffffbb ffffff87 ffffffe5 ffffff93 > ffffff81 ffffffe6 ffffff8e ffffffa5 ffffffe5 ffffff8f ffffff97 ffffffe6 > ffffff88 ffffff91 ffffffe4 ffffffbb ffffffac 20 ffffffe6 ffffff96 ffffffb9 > ffffffe4 ffffffbe ffffffbf > > I have google a lot, knowing that it might be the problem with unicharset, > but I don't know how to solve it. > > In addition, when I finetune this traineddata(from tessdata_best), with > training files also generated with tesstrain.sh, it also report this bug. > > Anyone who can help me? thanks! > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/927deb67-4d7d-45d9-b5ab-091702c012db%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/927deb67-4d7d-45d9-b5ab-091702c012db%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUrxgpor%2B8gEib6R3cqcUt9Wwqy9jbtZPgsYa0v3AHgqw%40mail.gmail.com.

