I suggest that you open issue in tesstrain repo. The makefile does training from scratch. Is that what you wanted? Do you have a large enough training text - how many lines? How many iterations for training?
Eval Char error rate=133.33333, Word error rate=96.875 That is a very high error rate. You need to get it down to 0%. On Thu, Oct 10, 2019 at 11:26 AM J L <[email protected]> wrote: > My system info: > - OS: Ubuntu Desktop 18.04 LTS (4.15.0-55-generic) > > > Hi. > > I am beginner and am trying to train some Korean character images for > Korean recognition. > > To understand how to train with Tesseract 4.0 LSTM, I followed Tesstrain. > > I followed lines of Makefile in the Tesstrain step by step, and most of > steps seemed to work fine until creating traineddata. > > > *In detail:* > > 1. I made box files and unicharset by following this lines > <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L128-L138> > . > > 2. I made lstmf files by following this lines > <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L140-L145> > . > > 3. I made two split file lists for training and evaluation by following this > lines > <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L115-L116> > . > > 4. Before combining lang model, I downloaded radical-stroke.txt by > following this line > <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L191>, > and 3 langdata files (kor.punc, kor.numbers, and kor.wordlist) from this > link <https://github.com/tesseract-ocr/langdata_lstm/tree/master/kor>. > > I didn't download kor.config file because it cause an error that > chi_tra.traineddata is needed. > > 5. I combined lang model by following this lines > <https://github.com/tesseract-ocr/tesstrain/blob/cf7854cbf2a07013fc3df2bbaddebf719534b27b/Makefile#L255-L264> > . > > 6. Then I started LSTM training by following this lines > <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L173-L180> > . > > 7. I tested them. The results are like: > lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmeval --traineddata > data/kor/kor.traineddata --model data/kor/checkpoints/kor_checkpoint > --eval_listfile data/kor/list.eval > data/kor/checkpoints/kor_checkpoint is not a recognition model, trying > training checkpoint... > Loaded 1/1 lines (1-1) of document > data/ground-truth/kor.malgun.exp249.lstmf > Loaded 1/1 lines (1-1) of document > data/ground-truth/kor.malgun.exp228.lstmf > Truth:먹 > OCR :이 > Truth:독 > OCR :이 > Loaded 1/1 lines (1-1) of document > data/ground-truth/kor.malgun.exp197.lstmf > Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp41.lstmf > Truth:파 > OCR :이 > Truth:신 > OCR :열 > ... (skip) > At iteration 0, stage 0, Eval Char error rate=133.33333, Word error > rate=96.875 > > There seems to be no problem with the results. > > 8. I made traineddata output file. > lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmtraining --stop_training \ > --continue_from data/kor/checkpoints/kor_checkpoint \ > --traineddata data/kor/kor.traineddata \ > --model_output usr/share/tessdata/kor.traineddata > > 9. Then I used tesseract with kor.malgun.exp197.tif. the TIF file was > shown to *'이'* when I followed step 7 (testing with lstmeval). So I > expected the same result. > lim@ubuntu:~/tools/tesstrain$ usr/bin/tesseract > data/ground-truth/kor.malgun.exp197.tif stdout -l kor --psm 6 > result > > But the real result was totally mess. It's the result: > > [image: res.JPG] > > > > Why the results of `lstmeval` and `tesseract` are different? > > Thank you... > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/074b17ee-cb7c-49a2-a653-1180f6190254%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/074b17ee-cb7c-49a2-a653-1180f6190254%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXdGFuZRCwvq9mME9ahXj1LHzZwgq6DcES00FZ_QeRCjw%40mail.gmail.com.

