[tesseract-ocr] Why are the results of lstmeval and tesseract different?

J L Wed, 09 Oct 2019 22:56:25 -0700

My system info:
- OS: Ubuntu Desktop 18.04 LTS (4.15.0-55-generic)

Hi.

I am beginner and am trying to train some Korean character images for
Korean recognition.

To understand how to train with Tesseract 4.0 LSTM, I followed Tesstrain.

I followed lines of Makefile in the Tesstrain step by step, and most of
steps seemed to work fine until creating traineddata.

*In detail:*

1. I made box files and unicharset by following this lines
<https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L128-L138>.

2. I made lstmf files by following this lines
<https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L140-L145>.

3. I made two split file lists for training and evaluation by following this
lines
<https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L115-L116>.

4. Before combining lang model, I downloaded radical-stroke.txt by
following this line
<https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L191>, and
3 langdata files (kor.punc, kor.numbers, and kor.wordlist) from this link
<https://github.com/tesseract-ocr/langdata_lstm/tree/master/kor>.

I didn't download kor.config file because it cause an error that
chi_tra.traineddata is needed.

5. I combined lang model by following this lines
<https://github.com/tesseract-ocr/tesstrain/blob/cf7854cbf2a07013fc3df2bbaddebf719534b27b/Makefile#L255-L264>
.

6. Then I started LSTM training by following this lines
<https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L173-L180>.

7. I tested them. The results are like:
lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmeval --traineddata
data/kor/kor.traineddata --model data/kor/checkpoints/kor_checkpoint
--eval_listfile data/kor/list.eval
data/kor/checkpoints/kor_checkpoint is not a recognition model, trying
training checkpoint...
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp249.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp228.lstmf
Truth:먹
OCR :이
Truth:독
OCR :이
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp197.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp41.lstmf
Truth:파
OCR :이
Truth:신
OCR :열
... (skip)
At iteration 0, stage 0, Eval Char error rate=133.33333, Word error
rate=96.875

There seems to be no problem with the results.

8. I made traineddata output file.
lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmtraining --stop_training \
--continue_from data/kor/checkpoints/kor_checkpoint \
--traineddata data/kor/kor.traineddata \
--model_output usr/share/tessdata/kor.traineddata

9. Then I used tesseract with kor.malgun.exp197.tif. the TIF file was shown
to *'이'* when I followed step 7 (testing with lstmeval). So I expected the
same result.
lim@ubuntu:~/tools/tesstrain$ usr/bin/tesseract
data/ground-truth/kor.malgun.exp197.tif stdout -l kor --psm 6 > result

But the real result was totally mess. It's the result:

[image: res.JPG]

Why the results of `lstmeval` and `tesseract` are different?

Thank you...

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/074b17ee-cb7c-49a2-a653-1180f6190254%40googlegroups.com.

[tesseract-ocr] Why are the results of lstmeval and tesseract different?

Reply via email to