Re: [tesseract-ocr] Help for training Akkadian language for Tesseract 4 needed

Wincent Balin Sat, 22 Feb 2020 01:22:24 -0800

Hello Shree,

I tried that. The command was


lstmtraining   --traineddata data/akk/akk.traineddata   --old_traineddata 
/usr/share/tesseract-ocr/4.00/tessdata/akk-1m.traineddata   --continue_from 
data/akk-1m/akk.lstm   --model_output data/akk/checkpoints/akk   
--train_listfile data/akk/list.train   --eval_listfile data/akk/list.eval   
--max_iterations 1000   --debug_level -1

and the output started with

Loaded file data/akk/checkpoints/akk_checkpoint, unpacking...
Successfully restored trainer from data/akk/checkpoints/akk_checkpoint
Loaded 1/1 pages (1-1) of document 
data/akk-ground-truth/P336598.000347.CuneiformComposite.exp0.lstmf
Loaded 1/1 pages (1-1) of document 
data/akk-ground-truth/P238121.000012.CuneiformNAOutline_Medium.exp0.lstmf

and ended with

Loaded 1/1 pages (1-1) of document 
data/akk-ground-truth/Q005388.000005.Segoe_UI_Historic.exp0.lstmf
At iteration 4716762/4760600/4760600, Mean rms=1.436%, delta=8.366%, char 
train=105.86%, word train=86.31%, skip ratio=0%,  wrote checkpoint.

Finished! Error rate = 88.246

Do I have have to retrain completely from scratch, meaning without loading 
the previous checkpoint?

Maybe I should check out another approach from yours and try to train with 
one font excluded, so the LSTM converges.

Another thought: I tried training Akkadian with Tesseract 4 once before, 
but with ground truth consisting of short text files with multiple lines of 
text, not one-liners. Obviously I used PSM 6, not PSM 11. Is there anything 
wrong with this approach?


Am Montag, 17. Februar 2020 08:23:38 UTC+1 schrieb shree:
>
> Try lstmtraining again for 1000 iterations with --debug_level -1 
>
>
>
>
> On Mon, Feb 17, 2020, 01:46 Wincent Balin <[email protected] 
> <javascript:>> wrote:
>
>> Hello all,
>>
>> after preparing ground truth files for Akkadian language, I started the 
>> training using the *tesstrain *Makefile, but over 4000000 iterations 
>> later, the output is like following:
>>
>> At iteration 4437804/4478900/4478900, Mean rms=1.453%, delta=9.455%, char 
>> train=121.423%, word train=87.461%, skip ratio=0%,  wrote checkpoint.
>>
>> Does char train=121% mean CER of 121%? What could be the cause for such 
>> high values even after over 10 days of training?
>>
>> Yours truly,
>>
>> Wincent
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/79acb8ca-cb51-4e23-8853-ca4b3405a718%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/79acb8ca-cb51-4e23-8853-ca4b3405a718%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c5ccc3c8-f18f-4540-93e8-b55ffb37c3ac%40googlegroups.com.

Re: [tesseract-ocr] Help for training Akkadian language for Tesseract 4 needed

Reply via email to