Re: [tesseract-ocr] Change unicharset

Fanatico Thu, 12 Apr 2018 22:03:03 -0700

I already did it, but I keep getting this error on "training/tesstrain.sh":
No block overlapping textline: 가능한 튤립 첫 칼럼 절차 주 - 하기 말썽쟁이 같다 ㆍ 상품권 팁 |
No block overlapping textline: 겪은 덕숀 수 대 라이브 넥 ' 토론 게시판 １０ 헵번 등 관련 담뿍
No block overlapping textline: 자리 유통 댈 월 피부 쥬얼리 에 뿌찢 타겟 그룹 안팎일 똑똑한
No block overlapping textline: 카페 빅뱅 같이 에 전문 악화 심사 오픈 범죄 모르는 추측 협회 선택

Warning: properties incomplete for index 71 = ｌ
Warning: properties incomplete for index 153 = ，
Warning: properties incomplete for index 182 = ？
Warning: properties incomplete for index 313 = １
Warning: properties incomplete for index 314 = ０
Warning: properties incomplete for index 368 = ５
Warning: properties incomplete for index 579 = ］
Warning: properties incomplete for index 720 = －
Warning: properties incomplete for index 918 = ２
Warning: properties incomplete for index 941 = ￥
Warning: properties incomplete for index 969 = ＆

Other case Ｌ of ｌ is not in unicharset
Mirror 〔 of 〕 is not in unicharset
Mirror 】 of 【 is not in unicharset
Mirror ［ of ］ is not in unicharset
Mirror 「 of 」 is not in unicharset

And this error on "training/lstmeval":
Can't encode transcription: '泰1.5 愚4 鎢共和15 欹地 鯪 閒聊 叫價23 350.00 庴設備 經理. 
的學習56次華僑' in language ''
Encoding of string failed! Failure bytes: ffffffe5 ffffff8f ffffffaf 
ffffffe8 ffffff83 ffffffbd ffffffe6 ffffffac ffffffa1 ffffffe6 ffffff95 
ffffffb8 20 ffffffe9 ffffffb5 ffffff91 20 ffffffe7 ffffff80 ffffff86 20 3a 
2e 20 ffffffe6 ffffff88 ffffff91 32 35 4a 61 76 61 20 ffffffe6 ffffff8a 
ffffff8a 3e 20 ffffffe8 ffffffa9 ffffff95 ffffffe5 ffffff83 ffffffb9 20 
ffffffe5 ffffff86 ffffffa4 ffffffe5 ffffffae ffffffb6 5b 20 ffffffe6 
ffffff94 ffffff9d ffffffe5 ffffffbd ffffffb1 20 ffffffe9 ffffffa0 ffffff81 
ffffffe9 ffffff9d ffffffa2 20 ffffffe5 ffffff88 ffffff86 ffffffe9 ffffff9b 
ffffffa2 28 32 37 20 32 39 20 ffffffe7 ffffffac ffffffac 20 ffffffe7 
ffffff86 ffffffbe ffffffe7 ffffff86 ffffffb1
Can't encode transcription: ') 可能次數 鵑 瀆 :. 我25Java 把> 評價 冤家[ 攝影 頁面 分離(27 29 
第 熾熱' in language ''
Encoding of string failed! Failure bytes: ffffffe5 ffffff9f ffffffba 
ffffffe9 ffffff87 ffffff91 58 44 20 ffffffe5 ffffff85 ffffffa8 ffffffe6 
ffffff96 ffffff87 20 ffffffe7 ffffffb1 ffffff93 20 ffffffe9 ffffff97 
ffffff9c ffffffe9 ffffff96 ffffff89 ffffffe8 ffffffb9 ffffffbc ffffffe9 
ffffff91 ffffffb0 ffffffe5 ffffff8c ffffff99 ffffffe8 ffffff81 ffffffb7 29 
20 30 32 ffffffe7 ffffff82 ffffff96 ffffffe9 ffffffbe ffffff8d ffffffe6 
ffffff8d ffffffb2 ffffffe9 ffffffa2 ffffffa8 ffffffe5 ffffff85 ffffff83 
ffffffe9 ffffff89 ffffff86 ffffffe5 ffffffba ffffff8a ffffffe7 ffffffb7 
ffffff9a ffffffe4 ffffffb8 ffffff8a ffffffe9 ffffffb8 ffffff9e 20 ffffffe5 
ffffffaa ffffffbe ffffffe5 ffffff92 ffffff8c 53 75 7a 75 6b 69 20 ffffffe6 
ffffff89 ffffff80 ffffffef ffffffb9 ffffff9c 20 31 32

Here the steps I did things: (I did these steps after each of the methods I 
reported on the first post)

# Fine tuning kor - tesseract 4.0

Reference: 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

Remember that ```kor.traineddata``` has the line ``` ```

### My folder organization

```
~/projects
├── ocr
   └── training
   |   └── kortrain
   |       ├── eval
   |       ├── new_train
   |       └── kor_from_full
   |
   ├── tesseract
   |   └─ tessdata
   |
   ├── tessdata_best
   └── langdata
```

#### korean fonts that con=me with mac
AppleGothic weight=237
AppleGothic weight=237 Italic
AppleMyungjo
AppleMyungjo Italic

### fonts working
  "Arial Unicode MS" \
  "HCR Batang" \
  "Source Han Serif" \
  "Source Han Serif K" \
  "Source Han Serif SC" \

### fonts not working
  "210 Byulbitcha" \
  "210 Byulddongbyul" \
  "210 HaneuljungwonOTF" \
  "210 Misslee" \
  "210 Sangsangongjakso" \
  "210 Sunflower" \
  "Baekmuk Batang" \
  "Baekmuk Gulim" \
  "Baekmuk Dotum" \
  "DX아기사랑B" \
  "HanS" \
  "HL" \
  "NanumGothic Eco" \
  "NanumMyeongjo Eco" \
  "SangSangAnt" \
  "Typo_JeongJo" \
  "Typo_SSiMyungJo" \
  "UhBee Joker" \

### Steps

1 - Add these paths in your ~/.bash_profile
```
export PANGOCAIRO_BACKEND=fc
export TESSDATADIR=~/projects/tesseract/tessdata
export SCROLLVIEW_PATH=~/projects/tesseract/java
```

1 - reate the folders:
```
mkdir ~/projects
mkdir ~/projects/ocr
mkdir ~/projects/ocr/training
mkdir ~/projects/ocr/training/kortrain/
mkdir ~/projects/ocr/training/kortrain/new_train
mkdir ~/projects/ocr/training/kortrain/kor_from_full
mkdir ~/projects/ocr/training/kortrain/eval
```

2 - Create a new ```kor.training_text``` and include the lines to generate 
the images. (The more the better)

Obs.: Since Korean can use some Chinese characters (hanja) I'm merging the 
```kor.training_text``` with the ```chi_tra.training_text```

Reference:
https://en.wikipedia.org/wiki/Hanja
https://www.howtostudykorean.com/hanja-unit-1-lessons-1-20/hanja-lesson-1/

3 - Save this new ```kor.training_text``` to 
```~/projects/ocr/training/kortrain/new_train```

4 - Download and install this font
https://www.archlinux.org/packages/extra/any/ttf-baekmuk/download/
For more fonts download here:
http://software.naver.com/software/fontList.nhn?categoryId=I0000000

Arial Unicode MS      - https://www.wfonts.com/font/arial-unicode-ms
HCR Batang            - http://www.fontpalace.com/font-details/Batang/
Source Han Serif      -
Source Han Serif K    - https://typekit.com/fonts/source-han-serif-korean
Source Han Serif SC   - 
https://typekit.com/fonts/source-han-serif-simplified-chinese

5 - Check if your fonts has ben installed

```
text2image --list_available_fonts --fonts_dir=~/Library/Fonts
```

Obs.: Font dir for MAC can be:  ~/Library/Fonts
                                /Library/Fonts/
                                /Network/Library/Fonts/
                                /System/Library/Fonts/
                                /System Folder/Fonts/

6 - Create the new training file

```
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir ~/Library/Fonts \
  --lang kor \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/projects/langdata \
  --wordlist ~/projects/ocr/training/kortrain/kor.wordlist \
  --training_text 
~/projects/ocr/training/kortrain/new_train/kor.training_text \
  --fontlist  "Arial Unicode MS" \
              "HCR Batang" \
              "Source Han Serif" \
              "Source Han Serif K" \
              "Source Han Serif SC" \
  --output_dir ~/projects/ocr/training/kortrain/new_train
```

Obs.: I'm already getting some errors:

```
No block overlapping textline: 가능한 튤립 첫 칼럼 절차 주 - 하기 말썽쟁이 같다 ㆍ 상품권 팁 |
No block overlapping textline: 겪은 덕숀 수 대 라이브 넥 ' 토론 게시판 １０ 헵번 등 관련 담뿍
No block overlapping textline: 자리 유통 댈 월 피부 쥬얼리 에 뿌찢 타겟 그룹 안팎일 똑똑한
No block overlapping textline: 카페 빅뱅 같이 에 전문 악화 심사 오픈 범죄 모르는 추측 협회 선택

Warning: properties incomplete for index 71 = ｌ
Warning: properties incomplete for index 153 = ，
Warning: properties incomplete for index 182 = ？
Warning: properties incomplete for index 313 = １
Warning: properties incomplete for index 314 = ０
Warning: properties incomplete for index 368 = ５
Warning: properties incomplete for index 579 = ］
Warning: properties incomplete for index 720 = －
Warning: properties incomplete for index 918 = ２
Warning: properties incomplete for index 941 = ￥
Warning: properties incomplete for index 969 = ＆

Other case Ｌ of ｌ is not in unicharset
Mirror 〔 of 〕 is not in unicharset
Mirror 】 of 【 is not in unicharset
Mirror ［ of ］ is not in unicharset
Mirror 「 of 」 is not in unicharset
```

7 - Create another ```kor.training_text``` and include some text, this one 
is going to be used to evaluate the training that we are going to do.

Obs.: Since Korean can use some Chinese characters (hanja) I'm merging the 
```chi_tra.training_text``` with the ```kor.training_text```

Reference:
https://en.wikipedia.org/wiki/Hanja
https://www.howtostudykorean.com/hanja-unit-1-lessons-1-20/hanja-lesson-1/

8 - Save this new ```kor.training_text``` to 
```~/projects/ocr/training/kortrain/eval```

9 - Create the new eval file

```
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir ~/Library/Fonts \
  --lang kor \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/projects/langdata \
  --training_text ~/projects/ocr/training/kortrain/eval/kor.training_text \
  --fontlist  "Arial Unicode MS" \
              "HCR Batang" \
              "Source Han Serif" \
              "Source Han Serif K" \
              "Source Han Serif SC" \
  --output_dir ~/projects/ocr/training/kortrain/eval
```

Obs.: I'm already getting some errors:

```
Mirror 」 of 「 is not in unicharset
Mirror 】 of 【 is not in unicharset
Other case q of Q is not in unicharset
Other case Z of z is not in unicharset
Mirror ﹛ of ﹜ is not in unicharset
Mirror < of > is not in unicharset
Setting script properties
Warning: properties incomplete for index 6 = ，
```

10 - Test current Accuracy for these fonts

```
~/projects/tesseract/training/lstmeval \
  --model ~/projects/tessdata_best/kor.traineddata \
  --eval_listfile 
~/projects/ocr/training/kortrain/eval/kor.training_files.txt
```

Obs.: most of the lines are returning these errors:

```
Can't encode transcription: '泰1.5 愚4 鎢共和15 欹地 鯪 閒聊 叫價23 350.00 庴設備 經理. 
的學習56次華僑' in language ''
Encoding of string failed! Failure bytes: ffffffe5 ffffff8f ffffffaf 
ffffffe8 ffffff83 ffffffbd ffffffe6 ffffffac ffffffa1 ffffffe6 ffffff95 
ffffffb8 20 ffffffe9 ffffffb5 ffffff91 20 ffffffe7 ffffff80 ffffff86 20 3a 
2e 20 ffffffe6 ffffff88 ffffff91 32 35 4a 61 76 61 20 ffffffe6 ffffff8a 
ffffff8a 3e 20 ffffffe8 ffffffa9 ffffff95 ffffffe5 ffffff83 ffffffb9 20 
ffffffe5 ffffff86 ffffffa4 ffffffe5 ffffffae ffffffb6 5b 20 ffffffe6 
ffffff94 ffffff9d ffffffe5 ffffffbd ffffffb1 20 ffffffe9 ffffffa0 ffffff81 
ffffffe9 ffffff9d ffffffa2 20 ffffffe5 ffffff88 ffffff86 ffffffe9 ffffff9b 
ffffffa2 28 32 37 20 32 39 20 ffffffe7 ffffffac ffffffac 20 ffffffe7 
ffffff86 ffffffbe ffffffe7 ffffff86 ffffffb1
Can't encode transcription: ') 可能次數 鵑 瀆 :. 我25Java 把> 評價 冤家[ 攝影 頁面 分離(27 29 
第 熾熱' in language ''
Encoding of string failed! Failure bytes: ffffffe5 ffffff9f ffffffba 
ffffffe9 ffffff87 ffffff91 58 44 20 ffffffe5 ffffff85 ffffffa8 ffffffe6 
ffffff96 ffffff87 20 ffffffe7 ffffffb1 ffffff93 20 ffffffe9 ffffff97 
ffffff9c ffffffe9 ffffff96 ffffff89 ffffffe8 ffffffb9 ffffffbc ffffffe9 
ffffff91 ffffffb0 ffffffe5 ffffff8c ffffff99 ffffffe8 ffffff81 ffffffb7 29 
20 30 32 ffffffe7 ffffff82 ffffff96 ffffffe9 ffffffbe ffffff8d ffffffe6 
ffffff8d ffffffb2 ffffffe9 ffffffa2 ffffffa8 ffffffe5 ffffff85 ffffff83 
ffffffe9 ffffff89 ffffff86 ffffffe5 ffffffba ffffff8a ffffffe7 ffffffb7 
ffffff9a ffffffe4 ffffffb8 ffffff8a ffffffe9 ffffffb8 ffffff9e 20 ffffffe5 
ffffffaa ffffffbe ffffffe5 ffffff92 ffffff8c 53 75 7a 75 6b 69 20 ffffffe6 
ffffff89 ffffff80 ffffffef ffffffb9 ffffff9c 20 31 32
Can't encode transcription: '基金XD 全文 籓 關閉蹼鑰匙職) 02炖龍捲風元鉆床線上鸞 媾和Suzuki 所﹜ 12' 
in language ''
Encoding of string failed! Failure bytes: ffffffe5 ffffffa5 ffffffa7 
ffffffe9 ffffff81 ffffff8b 20 68 74 74 70 30 35 ffffffe9 ffffff9b ffffffbb 
ffffffe5 ffffffad ffffff90 ffffffe9 ffffff8c ffffff84 ffffffe5 ffffffbd 
ffffffb1 20 ffffffe8 ffffffbe ffffffaf ffffffe8 ffffffab ffffff96 20 
ffffffe5 ffffff8d ffffff80 3e 20 ffffffe5 ffffff85 ffffffae 20 ffffffe6 
ffffff97 ffffffa9 ffffffe6 ffffff9c ffffff9f ffffffe5 ffffffbd ffffffa9 
ffffffe8 ffffff99 ffffffb9 ffffffe5 ffffffab ffffffb5 ffffffe5 ffffffaa 
ffffff9a 20 ffffffe8 ffffff9c ffffff98 ffffffe8 ffffff9b ffffff9b 20 
ffffffe5 ffffffa4 ffffff9a ffffffe5 ffffffaa ffffff92 ffffffe9 ffffffab 
ffffff94 ffffffe3 ffffff80 ffffff81 ffffffe9 ffffffac ffffffa9 52 65 73 65 
72 76 65 64 20 ffffffe5 ffffff98 ffffff89 31
Can't encode transcription: '奧運 http05電子錄影 辯論 區> 兮 早期彩虹嫵媚 蜘蛛 多媒體、鬩Reserved 
嘉1' in language ''
Encoding of string failed! Failure bytes: ffffffe5 ffffffad ffffffa3 
ffffffe6 ffffff95 ffffffb8 20 52 69 67 68 74 73 20 39 36 2e 20 ffffffe5 
ffffff8d ffffff80 ffffffe6 ffffff97 ffffff85 ffffffe7 ffffffa8 ffffff8b 20 
ffffffe7 ffffffb0 ffffff87 ffffffe7 ffffff94 ffffff9f ffffffe7 ffffffad 
ffffffb4 20 ffffffe5 ffffffb0 ffffff88 ffffffe6 ffffffab ffffff83 ffffffe5 
ffffffa8 ffffff81 ffffffe8 ffffff84 ffffff85 20 ffffffe7 ffffff8f ffffff8a 
ffffffe7 ffffff91 ffffff9a ffffffe9 ffffff85 ffffffa2 ffffffe9 ffffff96 
ffffff8b ffffffe5 ffffff95 ffffff9f 20 ffffffe7 ffffffb4 ffffffb0 ffffffe8 
ffffff8f ffffff8c ffffffe7 ffffffaf ffffff80 ffffffe8 ffffff82 ffffffa2 
ffffffe5 ffffff83 ffffff8f 32 31 20 32 37 ffffffe5 ffffff8d ffffff80
```

11 - Test current Accuracy for our training file

```
~/projects/tesseract/training/lstmeval \
  --model ~/projects/tesseract/tessdata/kor.traineddata \
  --eval_listfile 
~/projects/ocr/training/kortrain/eval/kor.training_files.txt
```

Obs.: most of the lines are returning these errors:

```
Can't encode transcription: '泰1.5 愚4 鎢共和15 欹地 鯪 閒聊 叫價23 350.00 庴設備 經理. 
的學習56次華僑' in language ''
Encoding of string failed! Failure bytes: ffffffe5 ffffff8f ffffffaf 
ffffffe8 ffffff83 ffffffbd ffffffe6 ffffffac ffffffa1 ffffffe6 ffffff95 
ffffffb8 20 ffffffe9 ffffffb5 ffffff91 20 ffffffe7 ffffff80 ffffff86 20 3a 
2e 20 ffffffe6 ffffff88 ffffff91 32 35 4a 61 76 61 20 ffffffe6 ffffff8a 
ffffff8a 3e 20 ffffffe8 ffffffa9 ffffff95 ffffffe5 ffffff83 ffffffb9 20 
ffffffe5 ffffff86 ffffffa4 ffffffe5 ffffffae ffffffb6 5b 20 ffffffe6 
ffffff94 ffffff9d ffffffe5 ffffffbd ffffffb1 20 ffffffe9 ffffffa0 ffffff81 
ffffffe9 ffffff9d ffffffa2 20 ffffffe5 ffffff88 ffffff86 ffffffe9 ffffff9b 
ffffffa2 28 32 37 20 32 39 20 ffffffe7 ffffffac ffffffac 20 ffffffe7 
ffffff86 ffffffbe ffffffe7 ffffff86 ffffffb1
Can't encode transcription: ') 可能次數 鵑 瀆 :. 我25Java 把> 評價 冤家[ 攝影 頁面 分離(27 29 
第 熾熱' in language ''
Encoding of string failed! Failure bytes: ffffffe5 ffffff9f ffffffba 
ffffffe9 ffffff87 ffffff91 58 44 20 ffffffe5 ffffff85 ffffffa8 ffffffe6 
ffffff96 ffffff87 20 ffffffe7 ffffffb1 ffffff93 20 ffffffe9 ffffff97 
ffffff9c ffffffe9 ffffff96 ffffff89 ffffffe8 ffffffb9 ffffffbc ffffffe9 
ffffff91 ffffffb0 ffffffe5 ffffff8c ffffff99 ffffffe8 ffffff81 ffffffb7 29 
20 30 32 ffffffe7 ffffff82 ffffff96 ffffffe9 ffffffbe ffffff8d ffffffe6 
ffffff8d ffffffb2 ffffffe9 ffffffa2 ffffffa8 ffffffe5 ffffff85 ffffff83 
ffffffe9 ffffff89 ffffff86 ffffffe5 ffffffba ffffff8a ffffffe7 ffffffb7 
ffffff9a ffffffe4 ffffffb8 ffffff8a ffffffe9 ffffffb8 ffffff9e 20 ffffffe5 
ffffffaa ffffffbe ffffffe5 ffffff92 ffffff8c 53 75 7a 75 6b 69 20 ffffffe6 
ffffff89 ffffff80 ffffffef ffffffb9 ffffff9c 20 31 32
Can't encode transcription: '基金XD 全文 籓 關閉蹼鑰匙職) 02炖龍捲風元鉆床線上鸞 媾和Suzuki 所﹜ 12' 
in language ''
Encoding of string failed! Failure bytes: ffffffe5 ffffffa5 ffffffa7 
ffffffe9 ffffff81 ffffff8b 20 68 74 74 70 30 35 ffffffe9 ffffff9b ffffffbb 
ffffffe5 ffffffad ffffff90 ffffffe9 ffffff8c ffffff84 ffffffe5 ffffffbd 
ffffffb1 20 ffffffe8 ffffffbe ffffffaf ffffffe8 ffffffab ffffff96 20 
ffffffe5 ffffff8d ffffff80 3e 20 ffffffe5 ffffff85 ffffffae 20 ffffffe6 
ffffff97 ffffffa9 ffffffe6 ffffff9c ffffff9f ffffffe5 ffffffbd ffffffa9 
ffffffe8 ffffff99 ffffffb9 ffffffe5 ffffffab ffffffb5 ffffffe5 ffffffaa 
ffffff9a 20 ffffffe8 ffffff9c ffffff98 ffffffe8 ffffff9b ffffff9b 20 
ffffffe5 ffffffa4 ffffff9a ffffffe5 ffffffaa ffffff92 ffffffe9 ffffffab 
ffffff94 ffffffe3 ffffff80 ffffff81 ffffffe9 ffffffac ffffffa9 52 65 73 65 
72 76 65 64 20 ffffffe5 ffffff98 ffffff89 31
Can't encode transcription: '奧運 http05電子錄影 辯論 區> 兮 早期彩虹嫵媚 蜘蛛 多媒體、鬩Reserved 
嘉1' in language ''
Encoding of string failed! Failure bytes: ffffffe5 ffffffad ffffffa3 
ffffffe6 ffffff95 ffffffb8 20 52 69 67 68 74 73 20 39 36 2e 20 ffffffe5 
ffffff8d ffffff80 ffffffe6 ffffff97 ffffff85 ffffffe7 ffffffa8 ffffff8b 20 
ffffffe7 ffffffb0 ffffff87 ffffffe7 ffffff94 ffffff9f ffffffe7 ffffffad 
ffffffb4 20 ffffffe5 ffffffb0 ffffff88 ffffffe6 ffffffab ffffff83 ffffffe5 
ffffffa8 ffffff81 ffffffe8 ffffff84 ffffff85 20 ffffffe7 ffffff8f ffffff8a 
ffffffe7 ffffff91 ffffff9a ffffffe9 ffffff85 ffffffa2 ffffffe9 ffffff96 
ffffff8b ffffffe5 ffffff95 ffffff9f 20 ffffffe7 ffffffb4 ffffffb0 ffffffe8 
ffffff8f ffffff8c ffffffe7 ffffffaf ffffff80 ffffffe8 ffffff82 ffffffa2 
ffffffe5 ffffff83 ffffff8f 32 31 20 32 37 ffffffe5 ffffff8d ffffff80
```

12 - create an lstm file from the main trained data

```
~/projects/tesseract/training/combine_tessdata \
  -e ~/projects/tesseract/tessdata/kor.traineddata \
  ~/projects/ocr/training/kortrain/kor_from_full/kor.lstm
```

13 - Start the training removing the last layer

```
~/projects/tesseract/training/lstmtraining \
  --debug_interval -1 \
  --continue_from ~/projects/ocr/training/kortrain/kor_from_full/kor.lstm \
  --traineddata 
~/projects/ocr/training/kortrain/new_train/kor/kor.traineddata \
  --append_index 5 \
  --net_spec '[Lfx256 O1c111]' \
  --model_output ~/projects/ocr/training/kortrain/kor_from_full/base \
  --train_listfile 
~/projects/ocr/training/kortrain/new_train/kor.training_files.txt \
  --eval_listfile 
~/projects/ocr/training/kortrain/eval/kor.training_files.txt \
  --max_iterations 3000 
&>~/projects/ocr/training/kortrain/kor_from_full/basetrain.log
```

14 - Monitor the log on another console

```
 tail -f ~/projects/ocr/training/kortrain/kor_from_full/basetrain.log
```

15 - Validate the result

```
~/projects/tesseract/training/lstmeval \
  --model ~/projects/ocr/training/kortrain/kor_from_full/base_checkpoint \
  --traineddata 
~/projects/ocr/training/kortrain/new_train/kor/kor.traineddata \
  --eval_listfile 
~/projects/ocr/training/kortrain/eval/kor.training_files.txt
```

16 - Train some more

```
~/projects/tesseract/training/lstmtraining \
  --debug_interval 100 \
  --continue_from ~/projects/ocr/training/kortrain/kor_from_full/kor.lstm \
  --traineddata 
~/projects/ocr/training/kortrain/new_train/kor/kor.traineddata \
  --model_output ~/projects/ocr/training/kortrain/kor_from_full/base \
  --train_listfile 
~/projects/ocr/training/kortrain/new_train/kor.training_files.txt \
  --eval_listfile 
~/projects/ocr/training/kortrain/eval/kor.training_files.txt \
  --max_iterations 60000 
&>~/projects/ocr/training/kortrain/kor_from_full/basetrain.log
```

Obs.: It took a few hours to complete

What are there values?

```
Mean rms=0.406%, delta=0.328%, train=0.817%(2.416%), skip ratio=0%
```

Obs.: ```--debug_interval > 0``` is generating this error:

```
ScrollView: Waiting for server...
Connection error. Quitting ScrollView Server...
sh: line 0: kill: %1: no such job
ERROR: Could not parse int32_t from --continue_from
```

On Friday, 13 April 2018 00:58:27 UTC-3, shree wrote:
>
> You cannot just overwrite the lstm.unicharset in a tarineddata file, the 
> unicharset has to be in sync with the other files in it i.e. lstm, dawgs, 
> recoder etc.
>
> >  I'm merging the ```kor.training_text``` with the 
> ```chi_tra.training_text``` for tests 
>
> You need to go through the complete training process after this. Only then 
> both set of characters will reflected in it. 
>
> You can try add a layer training with tessdata_best/kor.traineddata to 
> continue from.
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Fri, Apr 13, 2018 at 7:51 AM, Fanatico <fanati...@gmail.com 
> <javascript:>> wrote:
>
>> I'm trying to add Chinese to my Korean charset, but I'm not able to do it.
>>
>> Obs.: Since Korean can use some Chinese characters (hanja) I'm merging 
>> the ```kor.training_text``` with the ```chi_tra.training_text``` for tests
>>
>> Reference:
>> https://en.wikipedia.org/wiki/Hanja
>> https://www.howtostudykorean.com/hanja-unit-1-lessons-1-20/hanja-lesson-1/
>>
>> I tried to use:
>> combine_tessdata -u ~/projects/tessdata_best/kor.traineddata 
>> ~/projects/ocr/tmp/kor.
>> combine_tessdata -o ~/projects/tesseract/tessdata/kor.traineddata 
>> ~/projects/ocr/tmp/kor.lstm-unicharset
>>
>> I tried to use this line on "training/tesstrain.sh":
>> --wordlist ~/projects/ocr/training/kortrain/kor.wordlist \
>>
>> and I tried to use this line in the "kor.config" file
>> tessedit_load_sublangs chi_tra
>>
>>
>> But all these failed, if I run "training/tesstrain.sh" and go to the 
>> "kor/kor.unicharset" file, it only contains the Korean charset and I get 
>> errors like these:
>> Other case Ｌ of ｌ is not in unicharset
>> Mirror 〔 of 〕 is not in unicharset
>> Mirror 】 of 【 is not in unicharset
>> Mirror ［ of ］ is not in unicharset
>> Mirror 「 of 」 is not in unicharset
>> Setting script properties
>> Warning: properties incomplete for index 71 = ｌ
>> Warning: properties incomplete for index 153 = ，
>> Warning: properties incomplete for index 182 = ？
>> Warning: properties incomplete for index 313 = １
>> Warning: properties incomplete for index 314 = ０
>> Warning: properties incomplete for index 368 = ５
>> Warning: properties incomplete for index 579 = ］
>> Warning: properties incomplete for index 720 = －
>> Warning: properties incomplete for index 918 = ２
>> Warning: properties incomplete for index 941 = ￥
>> Warning: properties incomplete for index 969 = ＆
>> Config file is optional, continuing...
>> Null char=2
>>
>> If I run an test in a "training/lstmeval" that have Chinese and korean 
>> characters:
>> ~/projects/tesseract/training/lstmeval \
>>   --model ~/projects/tesseract/tessdata/kor.traineddata \
>>   --eval_listfile 
>> ~/projects/ocr/training/kortrain/eval/kor.training_files.txt
>>
>> I get a lot of these errors:
>> Can't encode transcription: '文章輯旭攝影會員肥功能 桐獎功能 時可以麂榻榻米(瘋狂using 辛亥道具' in 
>> language ''
>> Encoding of string failed! Failure bytes: ffffffe6 ffffffa0 ffffffb4 
>> ffffffe6 ffffffaa ffffff80 ffffffe6 ffffffbd ffffff98 ffffffe7 ffffff9f 
>> ffffffb3 ffffffe5 ffffffb1 ffffffb9 ffffffe5 ffffffaf ffffffba ffffffe5 
>> ffffffbb ffffff9f ffffffe5 ffffffb3 ffffffbb 20 ffffffe7 ffffffa7 ffffff92 
>> ffffffe4 ffffffb8 ffffff89 ffffffe8 ffffff89 ffffffb2 ffffffe8 ffffff8f 
>> ffffffab 20 ffffffe6 ffffff98 ffffff9f ffffffe6 ffffff9c ffffff9f ffffffe4 
>> ffffffba ffffff94 ffffffe5 ffffff98 ffffffa7 43 44 ffffffe4 ffffffbd 
>> ffffffbf ffffffe7 ffffff94 ffffffa8 ffffffe6 ffffffb4 ffffffaa ffffffe7 
>> ffffff91 ffffff9e ffffffe9 ffffff9c ffffff99 ffffffe6 ffffff85 ffffffb3 
>> ffffffe5 ffffff8d ffffff94 ffffffe8 ffffffad ffffffb0 20 ffffffe6 ffffff84 
>> ffffff9f ffffffe5 ffffff98 ffffff86 32 37 ffffffe6 ffffff92 ffffffb3 20 
>> ffffffe6 ffffffb1 ffffff95 ffffffe5 ffffffb0 ffffffbe
>> Can't encode transcription: '栴檀潘石屹寺廟峻 秒三色菫 星期五嘧CD使用洪瑞霙慳協議 感嘆27撳 汕尾' in 
>> language ''
>> Encoding of string failed! Failure bytes: ffffffe5 ffffffad ffffffa2 
>> ffffffe5 ffffffad ffffff90 4c 56 20 ffffffe6 ffffffb7 ffffffb1 ffffffe5 
>> ffffff9c ffffffb3 20 ffffffe5 ffffff92 ffffff96 ffffffe5 ffffff95 ffffffa1 
>> 20 ffffffe4 ffffffb8 ffffff8a ffffffe7 ffffffb7 ffffff9a 20 ffffffe6 
>> ffffffa6 ffffffab 20 ffffffe9 ffffff83 ffffffad ffffffe6 ffffffb3 ffffff93 
>> ffffffe5 ffffffbf ffffff97 ffffffe6 ffffff92 ffffffac 20 28 ffffffe6 
>> ffffffb0 ffffff91 ffffffe5 ffffff9c ffffff8b ffffffe6 ffffff9b ffffff86 20 
>> ffffffe6 ffffffb7 ffffffa4 ffffffe7 ffffffa9 ffffff8d 47 55 43 43 49 30 38 
>> ffffffe5 ffffff87 ffffffba ffffffe6 ffffff88 ffffff96 ffffffe8 ffffff80 
>> ffffff85 ffffffe6 ffffff94 ffffffbf 7c 68 61 73
>> Can't encode transcription: '孢子LV 深圳 咖啡 上線 榫 郭泓志撬 (民國曆 淤積GUCCI08出或者政|has' 
>> in language ''
>> Encoding of string failed! Failure bytes: ffffffe5 ffffff88 ffffff97 
>> ffffffe8 ffffffa1 ffffffa8 ffffffe7 ffffff9a ffffff84 ffffffe3 ffffff80 
>> ffffff8f ffffffe9 ffffff86 ffffff8d ffffffe9 ffffff86 ffffff90 20 2d 
>> ffffffe4 ffffffb8 ffffff80 ffffffe5 ffffff85 ffffffb6 ffffffe9 ffffffa4 
>> ffffff98 ffffffe6 ffffffb3 ffffff95 ffffffe5 ffffff8b ffffff99 37 36 38 
>> ffffffe4 ffffffb9 ffffff9e ffffffe4 ffffffb8 ffffff90 ffffffe7 ffffff9e 
>> ffffffb3 ffffffe5 ffffffad ffffff94 ffffffe8 ffffffa9 ffffff95 ffffffe5 
>> ffffff88 ffffff86 20 ffffffe7 ffffff8b ffffffb8 20 4d 6f 6f 6e 4b 4f 52 45 
>> 41 20 ffffffe5 ffffff9d ffffff87 ffffffe7 ffffffa2 ffffff91 ffffffe5 
>> ffffff9c ffffff98 ffffffe9 ffffff9a ffffff8a 20 31 39 39 35 ffffffe8 
>> ffffffb6 ffffffba 20 ffffffe5 ffffff91 ffffff82
>> Can't encode transcription: '列表的』醍醐 -一其餘法務768乞丐瞳孔評分 狸 MoonKOREA 均碑團隊 
>> 1995趺 呂' in language ''
>> Encoding of string failed! Failure bytes: 73 20 ffffffe6 ffffffaf 
>> ffffff94 ffffffe5 ffffff96 ffffffbb ffffffe6 ffffff8f ffffffae ffffffe9 
>> ffffff9c ffffff8d ffffffe6 ffffff9a ffffff90 20 3a 20 ffffffe9 ffffff82 
>> ffffffa3 ffffffe6 ffffffa2 ffffff9d ffffffe6 ffffffac ffffffbe 20 46 72 69 
>> 65 6e 64 20 ffffffe8 ffffff9b ffffff94 ffffffe8 ffffff9f ffffffb2 ffffffe6 
>> ffffff89 ffffff80 ffffffe5 ffffff9c ffffffa8 ffffffe5 ffffff9c ffffffb0 
>> ffffffe9 ffffff96 ffffff8b ffffffe3 ffffff80 ffffff82 ffffffe7 ffffff84 
>> ffffffa1 ffffffe8 ffffff86 ffffffa8 ffffffe8 ffffff84 ffffffb9 ffffffe7 
>> ffffff95 ffffffbf 20 ffffffe6 ffffffad ffffffb9 ffffffe4 ffffffba ffffffba 
>> ffffffe7 ffffff97 ffffff8a ffffffe7 ffffff99 ffffff92 20 ffffffe9 ffffff86 
>> ffffffaf 20 45 54 46 ffffffe9 ffffff96 ffffff91
>> Can't encode transcription: '| s 比喻揮霍暐 : 那條款 Friend 蛔蟲所在地開。無膨脹畿 歹人痊癒 醯 
>> ETF閑' in language ''
>> Encoding of string failed! Failure bytes: ffffffe5 ffffff90 ffffff8c 
>> ffffffe6 ffffff99 ffffff82 ffffffe6 ffffff9c ffffff8d ffffffe5 ffffff8b 
>> ffffff99 20 44 56 44 20 ffffffe6 ffffff9c ffffff89 ffffffe6 ffffffa1 
>> ffffff83 ffffffe5 ffffff9c ffffff92 ffffffe9 ffffff9b ffffff99 31 33 20 
>> ffffffe3 ffffff80 ffffff82 20 ffffffe8 ffffff96 ffffff8f ffffffe8 ffffff8b 
>> ffffffa1 ffffffe6 ffffff9c ffffff89 ffffffe9 ffffff84 ffffff89 ffffffe9 
>> ffffff96 ffffff93 ffffffe7 ffffff87 ffffffa7 ffffffe6 ffffff9c ^CCan't 
>> encode transcription: '同時服務 DVD 有桃園雙13 。 薏苡有鄉間燧月【 分歧和窗簾按鈕 。 偌 欲' in 
>> language ''
>>
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/d4e5bf80-6feb-47fb-b28e-8f5f5e58cf34%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/d4e5bf80-6feb-47fb-b28e-8f5f5e58cf34%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2888d9d6-9566-4167-a493-ca14e8eced10%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Change unicharset

Reply via email to