Re: [tesseract-ocr] Why are the results of lstmeval and tesseract different?

J L Thu, 10 Oct 2019 06:09:19 -0700

Okay, I will do as you suggested.

Thank you for answering my question.



2019년 10월 10일 목요일 오후 8시 23분 9초 UTC+9, shree 님의 말:
>
> I suggest that you open issue in tesstrain repo.
>
> The makefile does training from scratch. Is that what you wanted? Do you 
> have a large enough training text - how many lines? How many iterations for 
> training?
>
>  Eval Char error rate=133.33333, Word error rate=96.875
>
> That is a very high error rate. You need to get it down to 0%.
>
> On Thu, Oct 10, 2019 at 11:26 AM J L <[email protected] <javascript:>> 
> wrote:
>
>> My system info:
>> - OS: Ubuntu Desktop 18.04 LTS (4.15.0-55-generic)
>>
>>
>> Hi.
>>
>> I am beginner and am trying to train some Korean character images for 
>> Korean recognition.
>>
>> To understand how to train with Tesseract 4.0 LSTM, I followed Tesstrain.
>>
>> I followed lines of Makefile in the Tesstrain step by step, and most of 
>> steps seemed to work fine until creating traineddata.
>>
>>
>> *In detail:*
>>
>> 1. I made box files and unicharset by following this lines 
>> <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L128-L138>
>> .
>>
>> 2. I made lstmf files by following this lines 
>> <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L140-L145>
>> .
>>
>> 3. I made two split file lists for training and evaluation by following this 
>> lines 
>> <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L115-L116>
>> .
>>
>> 4. Before combining lang model, I downloaded radical-stroke.txt by 
>> following this line 
>> <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L191>, 
>> and 3 langdata files (kor.punc, kor.numbers, and kor.wordlist) from this 
>> link <https://github.com/tesseract-ocr/langdata_lstm/tree/master/kor>.
>>
>>     I didn't download kor.config file because it cause an error that 
>> chi_tra.traineddata is needed.
>>
>> 5. I combined lang model by following this lines 
>> <https://github.com/tesseract-ocr/tesstrain/blob/cf7854cbf2a07013fc3df2bbaddebf719534b27b/Makefile#L255-L264>
>> .
>>
>> 6. Then I started LSTM training by following this lines 
>> <https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L173-L180>
>> .
>>
>> 7. I tested them. The results are like:
>> lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmeval --traineddata 
>> data/kor/kor.traineddata --model data/kor/checkpoints/kor_checkpoint 
>> --eval_listfile data/kor/list.eval
>> data/kor/checkpoints/kor_checkpoint is not a recognition model, trying 
>> training checkpoint...
>> Loaded 1/1 lines (1-1) of document 
>> data/ground-truth/kor.malgun.exp249.lstmf
>> Loaded 1/1 lines (1-1) of document 
>> data/ground-truth/kor.malgun.exp228.lstmf
>> Truth:먹
>> OCR  :이
>> Truth:독
>> OCR  :이
>> Loaded 1/1 lines (1-1) of document 
>> data/ground-truth/kor.malgun.exp197.lstmf
>> Loaded 1/1 lines (1-1) of document 
>> data/ground-truth/kor.malgun.exp41.lstmf
>> Truth:파
>> OCR  :이
>> Truth:신
>> OCR  :열
>> ... (skip)
>> At iteration 0, stage 0, Eval Char error rate=133.33333, Word error 
>> rate=96.875
>>
>> There seems to be no problem with the results.
>>
>> 8. I made traineddata output file.
>> lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmtraining --stop_training \
>> --continue_from data/kor/checkpoints/kor_checkpoint \
>> --traineddata data/kor/kor.traineddata \
>> --model_output usr/share/tessdata/kor.traineddata
>>
>> 9. Then I used tesseract with kor.malgun.exp197.tif. the TIF file was 
>> shown to *'이'* when I followed step 7 (testing with lstmeval). So I 
>> expected the same result.
>> lim@ubuntu:~/tools/tesstrain$ usr/bin/tesseract 
>> data/ground-truth/kor.malgun.exp197.tif stdout -l kor --psm 6 > result
>>
>> But the real result was totally mess. It's the result:
>>
>> [image: res.JPG]
>>
>>
>>
>> Why the results of `lstmeval` and `tesseract` are different?
>>
>> Thank you...
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/074b17ee-cb7c-49a2-a653-1180f6190254%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/074b17ee-cb7c-49a2-a653-1180f6190254%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9dc69f72-d99c-4dec-b14c-2b93f5824acb%40googlegroups.com.

Re: [tesseract-ocr] Why are the results of lstmeval and tesseract different?

Reply via email to