Re: [tesseract-ocr] Change unicharset

2018-04-12 Thread Fanatico
I already did it, but I keep getting this error on "training/tesstrain.sh":
No block overlapping textline: 가능한 튤립 첫 칼럼 절차 주 - 하기 말썽쟁이 같다 ㆍ 상품권 팁 |
No block overlapping textline: 겪은 덕숀 수 대 라이브 넥 ' 토론 게시판 10 헵번 등 관련 담뿍
No block overlapping textline: 자리 유통 댈 월 피부 쥬얼리 에 뿌찢 타겟 그룹 안팎일 똑똑한
No block overlapping textline: 카페 빅뱅 같이 에 전문 악화 심사 오픈 범죄 모르는 추측 협회 선택

Warning: properties incomplete for index 71 = l
Warning: properties incomplete for index 153 = ,
Warning: properties incomplete for index 182 = ?
Warning: properties incomplete for index 313 = 1
Warning: properties incomplete for index 314 = 0
Warning: properties incomplete for index 368 = 5
Warning: properties incomplete for index 579 = ]
Warning: properties incomplete for index 720 = -
Warning: properties incomplete for index 918 = 2
Warning: properties incomplete for index 941 = ¥
Warning: properties incomplete for index 969 = &

Other case L of l is not in unicharset
Mirror 〔 of 〕 is not in unicharset
Mirror 】 of 【 is not in unicharset
Mirror [ of ] is not in unicharset
Mirror 「 of 」 is not in unicharset

And this error on "training/lstmeval":
Can't encode transcription: '泰1.5 愚4 鎢共和15 欹地 鯪 閒聊 叫價23 350.00 庴設備 經理. 
的學習56次華僑' in language ''
Encoding of string failed! Failure bytes: ffe5 ff8f ffaf 
ffe8 ff83 ffbd ffe6 ffac ffa1 ffe6 ff95 
ffb8 20 ffe9 ffb5 ff91 20 ffe7 ff80 ff86 20 3a 
2e 20 ffe6 ff88 ff91 32 35 4a 61 76 61 20 ffe6 ff8a 
ff8a 3e 20 ffe8 ffa9 ff95 ffe5 ff83 ffb9 20 
ffe5 ff86 ffa4 ffe5 ffae ffb6 5b 20 ffe6 
ff94 ff9d ffe5 ffbd ffb1 20 ffe9 ffa0 ff81 
ffe9 ff9d ffa2 20 ffe5 ff88 ff86 ffe9 ff9b 
ffa2 28 32 37 20 32 39 20 ffe7 ffac ffac 20 ffe7 
ff86 ffbe ffe7 ff86 ffb1
Can't encode transcription: ') 可能次數 鵑 瀆 :. 我25Java 把> 評價 冤家[ 攝影 頁面 分離(27 29 
第 熾熱' in language ''
Encoding of string failed! Failure bytes: ffe5 ff9f ffba 
ffe9 ff87 ff91 58 44 20 ffe5 ff85 ffa8 ffe6 
ff96 ff87 20 ffe7 ffb1 ff93 20 ffe9 ff97 
ff9c ffe9 ff96 ff89 ffe8 ffb9 ffbc ffe9 
ff91 ffb0 ffe5 ff8c ff99 ffe8 ff81 ffb7 29 
20 30 32 ffe7 ff82 ff96 ffe9 ffbe ff8d ffe6 
ff8d ffb2 ffe9 ffa2 ffa8 ffe5 ff85 ff83 
ffe9 ff89 ff86 ffe5 ffba ff8a ffe7 ffb7 
ff9a ffe4 ffb8 ff8a ffe9 ffb8 ff9e 20 ffe5 
ffaa ffbe ffe5 ff92 ff8c 53 75 7a 75 6b 69 20 ffe6 
ff89 ff80 ffef ffb9 ff9c 20 31 32



Here the steps I did things: (I did these steps after each of the methods I 
reported on the first post)

# Fine tuning kor - tesseract 4.0

Reference: 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

Remember that ```kor.traineddata``` has the line ``` ```

### My folder organization

```
~/projects
├── ocr
   └── training
   |   └── kortrain
   |   ├── eval
   |   ├── new_train
   |   └── kor_from_full
   |
   ├── tesseract
   |   └─ tessdata
   |
   ├── tessdata_best
   └── langdata
```

 korean fonts that con=me with mac
AppleGothic weight=237
AppleGothic weight=237 Italic
AppleMyungjo
AppleMyungjo Italic


### fonts working
  "Arial Unicode MS" \
  "HCR Batang" \
  "Source Han Serif" \
  "Source Han Serif K" \
  "Source Han Serif SC" \

### fonts not working
  "210 Byulbitcha" \
  "210 Byulddongbyul" \
  "210 HaneuljungwonOTF" \
  "210 Misslee" \
  "210 Sangsangongjakso" \
  "210 Sunflower" \
  "Baekmuk Batang" \
  "Baekmuk Gulim" \
  "Baekmuk Dotum" \
  "DX아기사랑B" \
  "HanS" \
  "HL" \
  "NanumGothic Eco" \
  "NanumMyeongjo Eco" \
  "SangSangAnt" \
  "Typo_JeongJo" \
  "Typo_SSiMyungJo" \
  "UhBee Joker" \

### Steps

1 - Add these paths in your ~/.bash_profile
```
export PANGOCAIRO_BACKEND=fc
export TESSDATADIR=~/projects/tesseract/tessdata
export SCROLLVIEW_PATH=~/projects/tesseract/java
```

1 - reate the folders:
```
mkdir ~/projects
mkdir ~/projects/ocr
mkdir ~/projects/ocr/training
mkdir ~/projects/ocr/training/kortrain/
mkdir ~/projects/ocr/training/kortrain/new_train
mkdir ~/projects/ocr/training/kortrain/kor_from_full
mkdir ~/projects/ocr/training/kortrain/eval
```

2 - Create a new ```kor.training_text``` and include the lines to generate 
the images. (The more the better)

Obs.: Since Korean can use some Chinese characters (hanja) I'm merging the 
```kor.training_text``` with the ```chi_tra.training_text```

Reference:
https://en.wikipedia.org/wiki/Hanja
https://www.howtostudykorean.com/hanja-unit-1-lessons-1-20/hanja-lesson-1/

3 - Save this new ```kor.training_text``` to 
```~/projects/ocr/training/kortrain/new_train```


4 - Download and install this font

Re: [tesseract-ocr] Change unicharset

2018-04-12 Thread ShreeDevi Kumar
You cannot just overwrite the lstm.unicharset in a tarineddata file, the
unicharset has to be in sync with the other files in it i.e. lstm, dawgs,
recoder etc.

>  I'm merging the ```kor.training_text``` with the
```chi_tra.training_text``` for tests

You need to go through the complete training process after this. Only then
both set of characters will reflected in it.

You can try add a layer training with tessdata_best/kor.traineddata to
continue from.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Apr 13, 2018 at 7:51 AM, Fanatico  wrote:

> I'm trying to add Chinese to my Korean charset, but I'm not able to do it.
>
> Obs.: Since Korean can use some Chinese characters (hanja) I'm merging the
> ```kor.training_text``` with the ```chi_tra.training_text``` for tests
>
> Reference:
> https://en.wikipedia.org/wiki/Hanja
> https://www.howtostudykorean.com/hanja-unit-1-lessons-1-20/hanja-lesson-1/
>
> I tried to use:
> combine_tessdata -u ~/projects/tessdata_best/kor.traineddata
> ~/projects/ocr/tmp/kor.
> combine_tessdata -o ~/projects/tesseract/tessdata/kor.traineddata
> ~/projects/ocr/tmp/kor.lstm-unicharset
>
> I tried to use this line on "training/tesstrain.sh":
> --wordlist ~/projects/ocr/training/kortrain/kor.wordlist \
>
> and I tried to use this line in the "kor.config" file
> tessedit_load_sublangs chi_tra
>
>
> But all these failed, if I run "training/tesstrain.sh" and go to the
> "kor/kor.unicharset" file, it only contains the Korean charset and I get
> errors like these:
> Other case L of l is not in unicharset
> Mirror 〔 of 〕 is not in unicharset
> Mirror 】 of 【 is not in unicharset
> Mirror [ of ] is not in unicharset
> Mirror 「 of 」 is not in unicharset
> Setting script properties
> Warning: properties incomplete for index 71 = l
> Warning: properties incomplete for index 153 = ,
> Warning: properties incomplete for index 182 = ?
> Warning: properties incomplete for index 313 = 1
> Warning: properties incomplete for index 314 = 0
> Warning: properties incomplete for index 368 = 5
> Warning: properties incomplete for index 579 = ]
> Warning: properties incomplete for index 720 = -
> Warning: properties incomplete for index 918 = 2
> Warning: properties incomplete for index 941 = ¥
> Warning: properties incomplete for index 969 = &
> Config file is optional, continuing...
> Null char=2
>
> If I run an test in a "training/lstmeval" that have Chinese and korean
> characters:
> ~/projects/tesseract/training/lstmeval \
>   --model ~/projects/tesseract/tessdata/kor.traineddata \
>   --eval_listfile ~/projects/ocr/training/kortrain/eval/kor.training_
> files.txt
>
> I get a lot of these errors:
> Can't encode transcription: '文章輯旭攝影會員肥功能 桐獎功能 時可以麂榻榻米(瘋狂using 辛亥道具' in
> language ''
> Encoding of string failed! Failure bytes: ffe6 ffa0 ffb4
> ffe6 ffaa ff80 ffe6 ffbd ff98 ffe7 ff9f
> ffb3 ffe5 ffb1 ffb9 ffe5 ffaf ffba ffe5
> ffbb ff9f ffe5 ffb3 ffbb 20 ffe7 ffa7 ff92
> ffe4 ffb8 ff89 ffe8 ff89 ffb2 ffe8 ff8f
> ffab 20 ffe6 ff98 ff9f ffe6 ff9c ff9f ffe4
> ffba ff94 ffe5 ff98 ffa7 43 44 ffe4 ffbd
> ffbf ffe7 ff94 ffa8 ffe6 ffb4 ffaa ffe7
> ff91 ff9e ffe9 ff9c ff99 ffe6 ff85 ffb3
> ffe5 ff8d ff94 ffe8 ffad ffb0 20 ffe6 ff84
> ff9f ffe5 ff98 ff86 32 37 ffe6 ff92 ffb3 20
> ffe6 ffb1 ff95 ffe5 ffb0 ffbe
> Can't encode transcription: '栴檀潘石屹寺廟峻 秒三色菫 星期五嘧CD使用洪瑞霙慳協議 感嘆27撳 汕尾' in
> language ''
> Encoding of string failed! Failure bytes: ffe5 ffad ffa2
> ffe5 ffad ff90 4c 56 20 ffe6 ffb7 ffb1 ffe5
> ff9c ffb3 20 ffe5 ff92 ff96 ffe5 ff95 ffa1
> 20 ffe4 ffb8 ff8a ffe7 ffb7 ff9a 20 ffe6
> ffa6 ffab 20 ffe9 ff83 ffad ffe6 ffb3 ff93
> ffe5 ffbf ff97 ffe6 ff92 ffac 20 28 ffe6
> ffb0 ff91 ffe5 ff9c ff8b ffe6 ff9b ff86 20
> ffe6 ffb7 ffa4 ffe7 ffa9 ff8d 47 55 43 43 49 30 38
> ffe5 ff87 ffba ffe6 ff88 ff96 ffe8 ff80
> ff85 ffe6 ff94 ffbf 7c 68 61 73
> Can't encode transcription: '孢子LV 深圳 咖啡 上線 榫 郭泓志撬 (民國曆 淤積GUCCI08出或者政|has'
> in language ''
> Encoding of string failed! Failure bytes: ffe5 ff88 ff97
> ffe8 ffa1 ffa8 ffe7 ff9a ff84 ffe3 ff80
> ff8f ffe9 ff86 ff8d ffe9 ff86 ff90 20 2d
> ffe4 ffb8 ff80 ffe5 ff85 ffb6 ffe9 ffa4
> ff98 ffe6 ffb3 ff95 ffe5 ff8b ff99 37 36 38
> ffe4 

[tesseract-ocr] Change unicharset

2018-04-12 Thread Fanatico
I'm trying to add Chinese to my Korean charset, but I'm not able to do it.

Obs.: Since Korean can use some Chinese characters (hanja) I'm merging the 
```kor.training_text``` with the ```chi_tra.training_text``` for tests

Reference:
https://en.wikipedia.org/wiki/Hanja
https://www.howtostudykorean.com/hanja-unit-1-lessons-1-20/hanja-lesson-1/

I tried to use:
combine_tessdata -u ~/projects/tessdata_best/kor.traineddata 
~/projects/ocr/tmp/kor.
combine_tessdata -o ~/projects/tesseract/tessdata/kor.traineddata 
~/projects/ocr/tmp/kor.lstm-unicharset

I tried to use this line on "training/tesstrain.sh":
--wordlist ~/projects/ocr/training/kortrain/kor.wordlist \

and I tried to use this line in the "kor.config" file
tessedit_load_sublangs chi_tra


But all these failed, if I run "training/tesstrain.sh" and go to the 
"kor/kor.unicharset" file, it only contains the Korean charset and I get 
errors like these:
Other case L of l is not in unicharset
Mirror 〔 of 〕 is not in unicharset
Mirror 】 of 【 is not in unicharset
Mirror [ of ] is not in unicharset
Mirror 「 of 」 is not in unicharset
Setting script properties
Warning: properties incomplete for index 71 = l
Warning: properties incomplete for index 153 = ,
Warning: properties incomplete for index 182 = ?
Warning: properties incomplete for index 313 = 1
Warning: properties incomplete for index 314 = 0
Warning: properties incomplete for index 368 = 5
Warning: properties incomplete for index 579 = ]
Warning: properties incomplete for index 720 = -
Warning: properties incomplete for index 918 = 2
Warning: properties incomplete for index 941 = ¥
Warning: properties incomplete for index 969 = &
Config file is optional, continuing...
Null char=2

If I run an test in a "training/lstmeval" that have Chinese and korean 
characters:
~/projects/tesseract/training/lstmeval \
  --model ~/projects/tesseract/tessdata/kor.traineddata \
  --eval_listfile 
~/projects/ocr/training/kortrain/eval/kor.training_files.txt

I get a lot of these errors:
Can't encode transcription: '文章輯旭攝影會員肥功能 桐獎功能 時可以麂榻榻米(瘋狂using 辛亥道具' in 
language ''
Encoding of string failed! Failure bytes: ffe6 ffa0 ffb4 
ffe6 ffaa ff80 ffe6 ffbd ff98 ffe7 ff9f 
ffb3 ffe5 ffb1 ffb9 ffe5 ffaf ffba ffe5 
ffbb ff9f ffe5 ffb3 ffbb 20 ffe7 ffa7 ff92 
ffe4 ffb8 ff89 ffe8 ff89 ffb2 ffe8 ff8f 
ffab 20 ffe6 ff98 ff9f ffe6 ff9c ff9f ffe4 
ffba ff94 ffe5 ff98 ffa7 43 44 ffe4 ffbd 
ffbf ffe7 ff94 ffa8 ffe6 ffb4 ffaa ffe7 
ff91 ff9e ffe9 ff9c ff99 ffe6 ff85 ffb3 
ffe5 ff8d ff94 ffe8 ffad ffb0 20 ffe6 ff84 
ff9f ffe5 ff98 ff86 32 37 ffe6 ff92 ffb3 20 
ffe6 ffb1 ff95 ffe5 ffb0 ffbe
Can't encode transcription: '栴檀潘石屹寺廟峻 秒三色菫 星期五嘧CD使用洪瑞霙慳協議 感嘆27撳 汕尾' in 
language ''
Encoding of string failed! Failure bytes: ffe5 ffad ffa2 
ffe5 ffad ff90 4c 56 20 ffe6 ffb7 ffb1 ffe5 
ff9c ffb3 20 ffe5 ff92 ff96 ffe5 ff95 ffa1 
20 ffe4 ffb8 ff8a ffe7 ffb7 ff9a 20 ffe6 
ffa6 ffab 20 ffe9 ff83 ffad ffe6 ffb3 ff93 
ffe5 ffbf ff97 ffe6 ff92 ffac 20 28 ffe6 
ffb0 ff91 ffe5 ff9c ff8b ffe6 ff9b ff86 20 
ffe6 ffb7 ffa4 ffe7 ffa9 ff8d 47 55 43 43 49 30 38 
ffe5 ff87 ffba ffe6 ff88 ff96 ffe8 ff80 
ff85 ffe6 ff94 ffbf 7c 68 61 73
Can't encode transcription: '孢子LV 深圳 咖啡 上線 榫 郭泓志撬 (民國曆 淤積GUCCI08出或者政|has' 
in language ''
Encoding of string failed! Failure bytes: ffe5 ff88 ff97 
ffe8 ffa1 ffa8 ffe7 ff9a ff84 ffe3 ff80 
ff8f ffe9 ff86 ff8d ffe9 ff86 ff90 20 2d 
ffe4 ffb8 ff80 ffe5 ff85 ffb6 ffe9 ffa4 
ff98 ffe6 ffb3 ff95 ffe5 ff8b ff99 37 36 38 
ffe4 ffb9 ff9e ffe4 ffb8 ff90 ffe7 ff9e 
ffb3 ffe5 ffad ff94 ffe8 ffa9 ff95 ffe5 
ff88 ff86 20 ffe7 ff8b ffb8 20 4d 6f 6f 6e 4b 4f 52 45 
41 20 ffe5 ff9d ff87 ffe7 ffa2 ff91 ffe5 
ff9c ff98 ffe9 ff9a ff8a 20 31 39 39 35 ffe8 
ffb6 ffba 20 ffe5 ff91 ff82
Can't encode transcription: '列表的』醍醐 -一其餘法務768乞丐瞳孔評分 狸 MoonKOREA 均碑團隊 1995趺 
呂' in language ''
Encoding of string failed! Failure bytes: 73 20 ffe6 ffaf ff94 
ffe5 ff96 ffbb ffe6 ff8f ffae ffe9 ff9c 
ff8d ffe6 ff9a ff90 20 3a 20 ffe9 ff82 ffa3 
ffe6 ffa2 ff9d ffe6 ffac ffbe 20 46 72 69 65 6e 64 
20