My system info: - OS: Ubuntu Desktop 18.04 LTS (4.15.0-55-generic)
Hi. I am beginner and am trying to train some Korean character images for Korean recognition. To understand how to train with Tesseract 4.0 LSTM, I followed Tesstrain. I followed lines of Makefile in the Tesstrain step by step, and most of steps seemed to work fine until creating traineddata. *In detail:* 1. I made box files and unicharset by following this lines <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L128-L138>. 2. I made lstmf files by following this lines <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L140-L145>. 3. I made two split file lists for training and evaluation by following this lines <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L115-L116>. 4. Before combining lang model, I downloaded radical-stroke.txt by following this line <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L191>, and 3 langdata files (kor.punc, kor.numbers, and kor.wordlist) from this link <https://github.com/tesseract-ocr/langdata_lstm/tree/master/kor>. I didn't download kor.config file because it cause an error that chi_tra.traineddata is needed. 5. I combined lang model by following this lines <https://github.com/tesseract-ocr/tesstrain/blob/cf7854cbf2a07013fc3df2bbaddebf719534b27b/Makefile#L255-L264> . 6. Then I started LSTM training by following this lines <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L173-L180>. 7. I tested them. The results are like: lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmeval --traineddata data/kor/kor.traineddata --model data/kor/checkpoints/kor_checkpoint --eval_listfile data/kor/list.eval data/kor/checkpoints/kor_checkpoint is not a recognition model, trying training checkpoint... Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp249.lstmf Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp228.lstmf Truth:먹 OCR :이 Truth:독 OCR :이 Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp197.lstmf Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp41.lstmf Truth:파 OCR :이 Truth:신 OCR :열 ... (skip) At iteration 0, stage 0, Eval Char error rate=133.33333, Word error rate=96.875 There seems to be no problem with the results. 8. I made traineddata output file. lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmtraining --stop_training \ --continue_from data/kor/checkpoints/kor_checkpoint \ --traineddata data/kor/kor.traineddata \ --model_output usr/share/tessdata/kor.traineddata 9. Then I used tesseract with kor.malgun.exp197.tif. the TIF file was shown to *'이'* when I followed step 7 (testing with lstmeval). So I expected the same result. lim@ubuntu:~/tools/tesstrain$ usr/bin/tesseract data/ground-truth/kor.malgun.exp197.tif stdout -l kor --psm 6 > result But the real result was totally mess. It's the result: [image: res.JPG] Why the results of `lstmeval` and `tesseract` are different? Thank you... -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/074b17ee-cb7c-49a2-a653-1180f6190254%40googlegroups.com.

