This problem has upset me for a few days! 

I use tesstrain.sh to generate training files and eval files. e.g.
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim 
--linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstraining/chi_simtrain 
--fontlist 'STFangsong' 'NotoSerifCJKjp-ExtraLight'

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim 
--linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstraining/chi_simeval 
--fontlist "STFangsong"

I can successfully training from scratch and eval the output checkpoint. 
However, when I eval with the official provided traineddata, it fails. I am 
training with chinese simplified language.
The command is:

> training/lstmeval --model ../tessdata/chi_sim.traineddata   
> --eval_listfile ~/tessnew/chi_simeval/chi_sim.training_files.txt
>
the model is downloaded from tessdata_best, the eval_listfile is generated 
from tesstrain.sh as above.

The error is as follow:
Truth:救市学校武汉 BLOG信息 www正式灸 姨外科楂洗浴过敏卢斐 Resources 儿需要过
OCR  :救市学校武汉 BLOG信息 www正式灸 姨外科楂洗浴过敏卢斐 Resources 儿需要过
Encoding of string failed! Failure bytes: ffffffe6 ffffffbc ffffffa9 
ffffffe6 ffffffb6 ffffffa1 20 ffffffe6 ffffffa1 ffffff80 ffffffe9 ffffffaa 
ffffff9c 54 58 ffffffe5 ffffff90 ffffffaf ffffffe5 ffffff8a ffffffa8 2e 
ffffffe7 ffffffa7 ffffff8d ffffffe6 ffffffa4 ffffff8d ffffffe5 ffffff9d 
ffffff91 ffffffe9 ffffff81 ffffff93 30 33 20 ffffffe5 ffffff81 ffffff9a 20 
ffffffe9 ffffff92 ffffff88 ffffffe7 ffffff81 ffffffb8 20 ffffffe9 ffffff81 
ffffffad ffffffe6 ffffffae ffffff83 20 38 31 3b ffffffe8 ffffff80 ffffff81 
ffffffe7 ffffff88 ffffffb7 ffffffe5 ffffff86 ffffff85 ffffffe5 ffffffae 
ffffffb9
Can't encode transcription: '啊时3全新跑送防伪漩涡 桀骜TX启动.种植坑道03 做 针灸 遭殃 81;老爷内容' in 
language ''
Encoding of string failed! Failure bytes: ffffffe8 ffffffb8 ffffff9d 
ffffffe5 ffffff8a ffffffa8 ffffffe6 ffffff80 ffffff81 20 ffffffe5 ffffff9e 
ffffff84 ffffffe6 ffffff96 ffffffad 31 ffffffe5 ffffff85 ffffffa8 ffffffe9 
ffffff9d ffffffa2 34 20 ffffffe8 ffffff84 ffffff82 ffffffe8 ffffff82 
ffffffaa 20 5b ffffffe9 ffffff80 ffffff89 ffffffe9 ffffffa1 ffffffb9 20 42 
6f 6f 6b 6d 61 72 6b 20 3a ffffffe6 ffffffad ffffffa4 ffffffe5 ffffff9f 
ffffffba ffffffe7 ffffff9d ffffffa3 ffffffe5 ffffffbe ffffffb7 ffffffe6 
ffffffba ffffff90 ffffffe7 ffffffa0 ffffff81 55 6e 69 76 65 72 73 69 74 79 
20 ffffffe6 ffffffb3 ffffffb0 ffffffe9 ffffff93 ffffffa2 43 2b 2b
Can't encode transcription: '内踝动态 垄断1全面4 脂肪 [选项 Bookmark :此基督德源码University 
泰铢C++' in language ''
Encoding of string failed! Failure bytes: ffffffe5 ffffff97 ffffffa6 20 
ffffffe6 ffffffa0 ffffffbc ffffffe5 ffffffae ffffffbd ffffffe6 ffffff81 
ffffff95 ffffffe6 ffffffb1 ffffffa4 ffffffe7 ffffff82 ffffffb3 ffffffe6 
ffffff9d ffffff83
Can't encode transcription: '上海 汇总顶部网站?Effect 合同do 外国共振页 (铝箔J彼此嗦 格宽恕汤炳权' in 
language ''
Encoding of string failed! Failure bytes: ffffffe8 ffffff9e ffffffaf 29 20 
ffffffe8 ffffff85 ffffffa5 34 38 20 ffffffe8 ffffff81 ffffff94 ffffffe7 
ffffffb3 ffffffbb 43 ffffffe6 ffffffb5 ffffffb7 ffffffe9 ffffff9a ffffff90 
ffffffe7 ffffff9e ffffff92 ffffffe5 ffffff87 ffffffbb ffffffe4 ffffffb8 
ffffff80 ffffffe5 ffffff8f ffffff8d ffffffe5 ffffffba ffffff94 ffffffe8 
ffffffaf ffffffb4 ffffffe6 ffffff98 ffffff8e 20 ffffffe4 ffffffba ffffffa4 
ffffffe6 ffffff98 ffffff93 32 34 20 ffffffe7 ffffff9b ffffffaf ffffffe6 
ffffffa2 ffffffa2 ffffffe5 ffffffbb ffffff96 20 ffffffe8 ffffff81 ffffff98 
ffffffe9 ffffffb9 ffffff8a ffffffe6 ffffffa1 ffffffa5
Can't encode transcription: '工具免责羌螯) 腥48 联系C海隐瞒击一反应说明 交易24 盯梢廖 聘鹊桥' in 
language ''
Truth:有效期也有瘤AZ -04 09尤其火炬and 05酰或温度他乡意网友 用户 数量2 处 ICP
OCR  :有效期也有瘤AZ -04 09尤其火炬and 05酰或温度他乡意网友 用户 数量2 处 ICE
Encoding of string failed! Failure bytes: ffffffe6 ffffff9a ffffff84 20 
ffffffe6 ffffff96 ffffffbd ffffffe8 ffffffa1 ffffff8c ffffffef ffffffbc 
ffffff8c ffffffe5 ffffff9c ffffffa8 ffffffe7 ffffffba ffffffbf ffffffe5 
ffffff8d ffffff81 ffffffe5 ffffffad ffffff97 ffffffe6 ffffff9e ffffffb6 
ffffffe5 ffffff9f ffffff95 30 33 42 61 73 65 64 20 ffffffe7 ffffffbd 
ffffff91 ffffffe9 ffffffa1 ffffffb5 20 ffffffe5 ffffffad ffffff97 ffffffe8 
ffffffb0 ffffff9c ffffffe5 ffffffa4 ffffffaf ffffffe6 ffffff82 ffffffac 
ffffffe8 ffffff87 ffffff82 ffffffe6 ffffff89 ffffff8d ffffffe8 ffffff83 
ffffffbd ffffffe9 ffffff9c ffffff80 ffffffe9 ffffffb2 ffffff86 ffffffe8 
ffffff81 ffffff94 ffffffe8 ffffffb0 ffffff8a
Can't encode transcription: '旗下的排名暄 施行,在线十字架埕03Based 网页 字谜夯悬臂才能需鲆联谊' in 
language ''
Encoding of string failed! Failure bytes: ffffffe6 ffffffb6 ffffff8e 
ffffffe9 ffffff82 ffffff8b 20 ffffffe6 ffffff9c ffffffba ffffffe5 ffffff9c 
ffffffba 2e 20 ffffffe5 ffffff8d ffffff87 ffffffe7 ffffffba ffffffa7 20 
ffffffe6 ffffff8b ffffffa3 20 ffffffe9 ffffffad ffffff8d ffffffe6 ffffff9d 
ffffffa5 ffffffe5 ffffffb4 ffffff83 ffffffe5 ffffffa5 ffffff84 20 ffffffe6 
ffffff8c ffffff81 ffffffe7 ffffffbb ffffffad ffffffe5 ffffff85 ffffffb1 
ffffffe6 ffffff9c ffffff89 3e 20 ffffffe9 ffffff80 ffffff9a ffffffe6 
ffffff8a ffffffa5 2d
Can't encode transcription: '出租机关教程峪惨绝人寰涎邋 机场. 升级 拣 魍来崃奄 持续共有> 通报-' in 
language ''
Truth:矢量 监测vip绀贮存登记2003卑时髦局长 客服 .库原创网络情况。傻瓜12与
OCR  :入量 监测vip绀贮存登记2003卑时紧局长 客服 .库原创网络情况。俊瓜12与
Encoding of string failed! Failure bytes: ffffffe8 ffffff8b ffffffa3 
ffffffe4 ffffffb8 ffffffb4 ffffffe5 ffffffba ffffff8a ffffffe8 ffffffbf 
ffffff99 ffffffe5 ffffff8c ffffff88 ffffffe7 ffffff89 ffffff99 ffffffe5 
ffffff88 ffffffa9 20 ffffffe8 ffffff8f ffffff9c ffffffe7 ffffff95 ffffffa6 
ffffffe6 ffffffa0 ffffff97 ffffffe8 ffffffae ffffffb2 ffffffe8 ffffffaf 
ffffff9d 20 ffffffe5 ffffffb4 ffffffbd ffffffe5 ffffffad ffffff90 ffffffe5 
ffffff92 ffffff8c ffffffe8 ffffffb0 ffffff90 31 34 ffffffe9 ffffff80 
ffffff9a ffffffe7 ffffff89 ffffff92 33 38 ffffffe6 ffffff9c ffffff80 
ffffffe7 ffffffba ffffffba ffffffe7 ffffffbb ffffff87 ffffffe5 ffffff93 
ffffff81 ffffffe6 ffffff8e ffffffa5 ffffffe5 ffffff8f ffffff97 ffffffe6 
ffffff88 ffffff91 ffffffe4 ffffffbb ffffffac 20 ffffffe6 ffffff96 ffffffb9 
ffffffe4 ffffffbe ffffffbf

I have google a lot, knowing that it might be the problem with unicharset, 
but I don't know how to solve it.

In addition, when I finetune this traineddata(from tessdata_best), with 
training files also generated with tesstrain.sh, it also report this bug.

Anyone who can help me? thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/927deb67-4d7d-45d9-b5ab-091702c012db%40googlegroups.com.

Reply via email to