Re: [tesseract-ocr] Change unicharset
I already did it, but I keep getting this error on "training/tesstrain.sh": No block overlapping textline: 가능한 튤립 첫 칼럼 절차 주 - 하기 말썽쟁이 같다 ㆍ 상품권 팁 | No block overlapping textline: 겪은 덕숀 수 대 라이브 넥 ' 토론 게시판 10 헵번 등 관련 담뿍 No block overlapping textline: 자리 유통 댈 월 피부 쥬얼리 에 뿌찢 타겟 그룹 안팎일 똑똑한 No block overlapping textline: 카페 빅뱅 같이 에 전문 악화 심사 오픈 범죄 모르는 추측 협회 선택 Warning: properties incomplete for index 71 = l Warning: properties incomplete for index 153 = , Warning: properties incomplete for index 182 = ? Warning: properties incomplete for index 313 = 1 Warning: properties incomplete for index 314 = 0 Warning: properties incomplete for index 368 = 5 Warning: properties incomplete for index 579 = ] Warning: properties incomplete for index 720 = - Warning: properties incomplete for index 918 = 2 Warning: properties incomplete for index 941 = ¥ Warning: properties incomplete for index 969 = & Other case L of l is not in unicharset Mirror 〔 of 〕 is not in unicharset Mirror 】 of 【 is not in unicharset Mirror [ of ] is not in unicharset Mirror 「 of 」 is not in unicharset And this error on "training/lstmeval": Can't encode transcription: '泰1.5 愚4 鎢共和15 欹地 鯪 閒聊 叫價23 350.00 庴設備 經理. 的學習56次華僑' in language '' Encoding of string failed! Failure bytes: ffe5 ff8f ffaf ffe8 ff83 ffbd ffe6 ffac ffa1 ffe6 ff95 ffb8 20 ffe9 ffb5 ff91 20 ffe7 ff80 ff86 20 3a 2e 20 ffe6 ff88 ff91 32 35 4a 61 76 61 20 ffe6 ff8a ff8a 3e 20 ffe8 ffa9 ff95 ffe5 ff83 ffb9 20 ffe5 ff86 ffa4 ffe5 ffae ffb6 5b 20 ffe6 ff94 ff9d ffe5 ffbd ffb1 20 ffe9 ffa0 ff81 ffe9 ff9d ffa2 20 ffe5 ff88 ff86 ffe9 ff9b ffa2 28 32 37 20 32 39 20 ffe7 ffac ffac 20 ffe7 ff86 ffbe ffe7 ff86 ffb1 Can't encode transcription: ') 可能次數 鵑 瀆 :. 我25Java 把> 評價 冤家[ 攝影 頁面 分離(27 29 第 熾熱' in language '' Encoding of string failed! Failure bytes: ffe5 ff9f ffba ffe9 ff87 ff91 58 44 20 ffe5 ff85 ffa8 ffe6 ff96 ff87 20 ffe7 ffb1 ff93 20 ffe9 ff97 ff9c ffe9 ff96 ff89 ffe8 ffb9 ffbc ffe9 ff91 ffb0 ffe5 ff8c ff99 ffe8 ff81 ffb7 29 20 30 32 ffe7 ff82 ff96 ffe9 ffbe ff8d ffe6 ff8d ffb2 ffe9 ffa2 ffa8 ffe5 ff85 ff83 ffe9 ff89 ff86 ffe5 ffba ff8a ffe7 ffb7 ff9a ffe4 ffb8 ff8a ffe9 ffb8 ff9e 20 ffe5 ffaa ffbe ffe5 ff92 ff8c 53 75 7a 75 6b 69 20 ffe6 ff89 ff80 ffef ffb9 ff9c 20 31 32 Here the steps I did things: (I did these steps after each of the methods I reported on the first post) # Fine tuning kor - tesseract 4.0 Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact Remember that ```kor.traineddata``` has the line ``` ``` ### My folder organization ``` ~/projects ├── ocr └── training | └── kortrain | ├── eval | ├── new_train | └── kor_from_full | ├── tesseract | └─ tessdata | ├── tessdata_best └── langdata ``` korean fonts that con=me with mac AppleGothic weight=237 AppleGothic weight=237 Italic AppleMyungjo AppleMyungjo Italic ### fonts working "Arial Unicode MS" \ "HCR Batang" \ "Source Han Serif" \ "Source Han Serif K" \ "Source Han Serif SC" \ ### fonts not working "210 Byulbitcha" \ "210 Byulddongbyul" \ "210 HaneuljungwonOTF" \ "210 Misslee" \ "210 Sangsangongjakso" \ "210 Sunflower" \ "Baekmuk Batang" \ "Baekmuk Gulim" \ "Baekmuk Dotum" \ "DX아기사랑B" \ "HanS" \ "HL" \ "NanumGothic Eco" \ "NanumMyeongjo Eco" \ "SangSangAnt" \ "Typo_JeongJo" \ "Typo_SSiMyungJo" \ "UhBee Joker" \ ### Steps 1 - Add these paths in your ~/.bash_profile ``` export PANGOCAIRO_BACKEND=fc export TESSDATADIR=~/projects/tesseract/tessdata export SCROLLVIEW_PATH=~/projects/tesseract/java ``` 1 - reate the folders: ``` mkdir ~/projects mkdir ~/projects/ocr mkdir ~/projects/ocr/training mkdir ~/projects/ocr/training/kortrain/ mkdir ~/projects/ocr/training/kortrain/new_train mkdir ~/projects/ocr/training/kortrain/kor_from_full mkdir ~/projects/ocr/training/kortrain/eval ``` 2 - Create a new ```kor.training_text``` and include the lines to generate the images. (The more the better) Obs.: Since Korean can use some Chinese characters (hanja) I'm merging the ```kor.training_text``` with the ```chi_tra.training_text``` Reference: https://en.wikipedia.org/wiki/Hanja https://www.howtostudykorean.com/hanja-unit-1-lessons-1-20/hanja-lesson-1/ 3 - Save this new ```kor.training_text``` to ```~/projects/ocr/training/kortrain/new_train``` 4 - Download and install this font
Re: [tesseract-ocr] Change unicharset
You cannot just overwrite the lstm.unicharset in a tarineddata file, the unicharset has to be in sync with the other files in it i.e. lstm, dawgs, recoder etc. > I'm merging the ```kor.training_text``` with the ```chi_tra.training_text``` for tests You need to go through the complete training process after this. Only then both set of characters will reflected in it. You can try add a layer training with tessdata_best/kor.traineddata to continue from. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Apr 13, 2018 at 7:51 AM, Fanaticowrote: > I'm trying to add Chinese to my Korean charset, but I'm not able to do it. > > Obs.: Since Korean can use some Chinese characters (hanja) I'm merging the > ```kor.training_text``` with the ```chi_tra.training_text``` for tests > > Reference: > https://en.wikipedia.org/wiki/Hanja > https://www.howtostudykorean.com/hanja-unit-1-lessons-1-20/hanja-lesson-1/ > > I tried to use: > combine_tessdata -u ~/projects/tessdata_best/kor.traineddata > ~/projects/ocr/tmp/kor. > combine_tessdata -o ~/projects/tesseract/tessdata/kor.traineddata > ~/projects/ocr/tmp/kor.lstm-unicharset > > I tried to use this line on "training/tesstrain.sh": > --wordlist ~/projects/ocr/training/kortrain/kor.wordlist \ > > and I tried to use this line in the "kor.config" file > tessedit_load_sublangs chi_tra > > > But all these failed, if I run "training/tesstrain.sh" and go to the > "kor/kor.unicharset" file, it only contains the Korean charset and I get > errors like these: > Other case L of l is not in unicharset > Mirror 〔 of 〕 is not in unicharset > Mirror 】 of 【 is not in unicharset > Mirror [ of ] is not in unicharset > Mirror 「 of 」 is not in unicharset > Setting script properties > Warning: properties incomplete for index 71 = l > Warning: properties incomplete for index 153 = , > Warning: properties incomplete for index 182 = ? > Warning: properties incomplete for index 313 = 1 > Warning: properties incomplete for index 314 = 0 > Warning: properties incomplete for index 368 = 5 > Warning: properties incomplete for index 579 = ] > Warning: properties incomplete for index 720 = - > Warning: properties incomplete for index 918 = 2 > Warning: properties incomplete for index 941 = ¥ > Warning: properties incomplete for index 969 = & > Config file is optional, continuing... > Null char=2 > > If I run an test in a "training/lstmeval" that have Chinese and korean > characters: > ~/projects/tesseract/training/lstmeval \ > --model ~/projects/tesseract/tessdata/kor.traineddata \ > --eval_listfile ~/projects/ocr/training/kortrain/eval/kor.training_ > files.txt > > I get a lot of these errors: > Can't encode transcription: '文章輯旭攝影會員肥功能 桐獎功能 時可以麂榻榻米(瘋狂using 辛亥道具' in > language '' > Encoding of string failed! Failure bytes: ffe6 ffa0 ffb4 > ffe6 ffaa ff80 ffe6 ffbd ff98 ffe7 ff9f > ffb3 ffe5 ffb1 ffb9 ffe5 ffaf ffba ffe5 > ffbb ff9f ffe5 ffb3 ffbb 20 ffe7 ffa7 ff92 > ffe4 ffb8 ff89 ffe8 ff89 ffb2 ffe8 ff8f > ffab 20 ffe6 ff98 ff9f ffe6 ff9c ff9f ffe4 > ffba ff94 ffe5 ff98 ffa7 43 44 ffe4 ffbd > ffbf ffe7 ff94 ffa8 ffe6 ffb4 ffaa ffe7 > ff91 ff9e ffe9 ff9c ff99 ffe6 ff85 ffb3 > ffe5 ff8d ff94 ffe8 ffad ffb0 20 ffe6 ff84 > ff9f ffe5 ff98 ff86 32 37 ffe6 ff92 ffb3 20 > ffe6 ffb1 ff95 ffe5 ffb0 ffbe > Can't encode transcription: '栴檀潘石屹寺廟峻 秒三色菫 星期五嘧CD使用洪瑞霙慳協議 感嘆27撳 汕尾' in > language '' > Encoding of string failed! Failure bytes: ffe5 ffad ffa2 > ffe5 ffad ff90 4c 56 20 ffe6 ffb7 ffb1 ffe5 > ff9c ffb3 20 ffe5 ff92 ff96 ffe5 ff95 ffa1 > 20 ffe4 ffb8 ff8a ffe7 ffb7 ff9a 20 ffe6 > ffa6 ffab 20 ffe9 ff83 ffad ffe6 ffb3 ff93 > ffe5 ffbf ff97 ffe6 ff92 ffac 20 28 ffe6 > ffb0 ff91 ffe5 ff9c ff8b ffe6 ff9b ff86 20 > ffe6 ffb7 ffa4 ffe7 ffa9 ff8d 47 55 43 43 49 30 38 > ffe5 ff87 ffba ffe6 ff88 ff96 ffe8 ff80 > ff85 ffe6 ff94 ffbf 7c 68 61 73 > Can't encode transcription: '孢子LV 深圳 咖啡 上線 榫 郭泓志撬 (民國曆 淤積GUCCI08出或者政|has' > in language '' > Encoding of string failed! Failure bytes: ffe5 ff88 ff97 > ffe8 ffa1 ffa8 ffe7 ff9a ff84 ffe3 ff80 > ff8f ffe9 ff86 ff8d ffe9 ff86 ff90 20 2d > ffe4 ffb8 ff80 ffe5 ff85 ffb6 ffe9 ffa4 > ff98 ffe6 ffb3 ff95 ffe5 ff8b ff99 37 36 38 > ffe4
[tesseract-ocr] Change unicharset
I'm trying to add Chinese to my Korean charset, but I'm not able to do it. Obs.: Since Korean can use some Chinese characters (hanja) I'm merging the ```kor.training_text``` with the ```chi_tra.training_text``` for tests Reference: https://en.wikipedia.org/wiki/Hanja https://www.howtostudykorean.com/hanja-unit-1-lessons-1-20/hanja-lesson-1/ I tried to use: combine_tessdata -u ~/projects/tessdata_best/kor.traineddata ~/projects/ocr/tmp/kor. combine_tessdata -o ~/projects/tesseract/tessdata/kor.traineddata ~/projects/ocr/tmp/kor.lstm-unicharset I tried to use this line on "training/tesstrain.sh": --wordlist ~/projects/ocr/training/kortrain/kor.wordlist \ and I tried to use this line in the "kor.config" file tessedit_load_sublangs chi_tra But all these failed, if I run "training/tesstrain.sh" and go to the "kor/kor.unicharset" file, it only contains the Korean charset and I get errors like these: Other case L of l is not in unicharset Mirror 〔 of 〕 is not in unicharset Mirror 】 of 【 is not in unicharset Mirror [ of ] is not in unicharset Mirror 「 of 」 is not in unicharset Setting script properties Warning: properties incomplete for index 71 = l Warning: properties incomplete for index 153 = , Warning: properties incomplete for index 182 = ? Warning: properties incomplete for index 313 = 1 Warning: properties incomplete for index 314 = 0 Warning: properties incomplete for index 368 = 5 Warning: properties incomplete for index 579 = ] Warning: properties incomplete for index 720 = - Warning: properties incomplete for index 918 = 2 Warning: properties incomplete for index 941 = ¥ Warning: properties incomplete for index 969 = & Config file is optional, continuing... Null char=2 If I run an test in a "training/lstmeval" that have Chinese and korean characters: ~/projects/tesseract/training/lstmeval \ --model ~/projects/tesseract/tessdata/kor.traineddata \ --eval_listfile ~/projects/ocr/training/kortrain/eval/kor.training_files.txt I get a lot of these errors: Can't encode transcription: '文章輯旭攝影會員肥功能 桐獎功能 時可以麂榻榻米(瘋狂using 辛亥道具' in language '' Encoding of string failed! Failure bytes: ffe6 ffa0 ffb4 ffe6 ffaa ff80 ffe6 ffbd ff98 ffe7 ff9f ffb3 ffe5 ffb1 ffb9 ffe5 ffaf ffba ffe5 ffbb ff9f ffe5 ffb3 ffbb 20 ffe7 ffa7 ff92 ffe4 ffb8 ff89 ffe8 ff89 ffb2 ffe8 ff8f ffab 20 ffe6 ff98 ff9f ffe6 ff9c ff9f ffe4 ffba ff94 ffe5 ff98 ffa7 43 44 ffe4 ffbd ffbf ffe7 ff94 ffa8 ffe6 ffb4 ffaa ffe7 ff91 ff9e ffe9 ff9c ff99 ffe6 ff85 ffb3 ffe5 ff8d ff94 ffe8 ffad ffb0 20 ffe6 ff84 ff9f ffe5 ff98 ff86 32 37 ffe6 ff92 ffb3 20 ffe6 ffb1 ff95 ffe5 ffb0 ffbe Can't encode transcription: '栴檀潘石屹寺廟峻 秒三色菫 星期五嘧CD使用洪瑞霙慳協議 感嘆27撳 汕尾' in language '' Encoding of string failed! Failure bytes: ffe5 ffad ffa2 ffe5 ffad ff90 4c 56 20 ffe6 ffb7 ffb1 ffe5 ff9c ffb3 20 ffe5 ff92 ff96 ffe5 ff95 ffa1 20 ffe4 ffb8 ff8a ffe7 ffb7 ff9a 20 ffe6 ffa6 ffab 20 ffe9 ff83 ffad ffe6 ffb3 ff93 ffe5 ffbf ff97 ffe6 ff92 ffac 20 28 ffe6 ffb0 ff91 ffe5 ff9c ff8b ffe6 ff9b ff86 20 ffe6 ffb7 ffa4 ffe7 ffa9 ff8d 47 55 43 43 49 30 38 ffe5 ff87 ffba ffe6 ff88 ff96 ffe8 ff80 ff85 ffe6 ff94 ffbf 7c 68 61 73 Can't encode transcription: '孢子LV 深圳 咖啡 上線 榫 郭泓志撬 (民國曆 淤積GUCCI08出或者政|has' in language '' Encoding of string failed! Failure bytes: ffe5 ff88 ff97 ffe8 ffa1 ffa8 ffe7 ff9a ff84 ffe3 ff80 ff8f ffe9 ff86 ff8d ffe9 ff86 ff90 20 2d ffe4 ffb8 ff80 ffe5 ff85 ffb6 ffe9 ffa4 ff98 ffe6 ffb3 ff95 ffe5 ff8b ff99 37 36 38 ffe4 ffb9 ff9e ffe4 ffb8 ff90 ffe7 ff9e ffb3 ffe5 ffad ff94 ffe8 ffa9 ff95 ffe5 ff88 ff86 20 ffe7 ff8b ffb8 20 4d 6f 6f 6e 4b 4f 52 45 41 20 ffe5 ff9d ff87 ffe7 ffa2 ff91 ffe5 ff9c ff98 ffe9 ff9a ff8a 20 31 39 39 35 ffe8 ffb6 ffba 20 ffe5 ff91 ff82 Can't encode transcription: '列表的』醍醐 -一其餘法務768乞丐瞳孔評分 狸 MoonKOREA 均碑團隊 1995趺 呂' in language '' Encoding of string failed! Failure bytes: 73 20 ffe6 ffaf ff94 ffe5 ff96 ffbb ffe6 ff8f ffae ffe9 ff9c ff8d ffe6 ff9a ff90 20 3a 20 ffe9 ff82 ffa3 ffe6 ffa2 ff9d ffe6 ffac ffbe 20 46 72 69 65 6e 64 20