Re: [tesseract-ocr] Help for training Akkadian language for Tesseract 4 needed

Shree Devi Kumar Sat, 22 Feb 2020 01:57:24 -0800

try with the following - ie with a new output name so that training starts
again from 0. The debug output for each iteration (line of text) will show
you if any particular font is not aligning or if there are some issues.


lstmtraining   --traineddata data/akk/akk.traineddata   --old_traineddata
/usr/share/tesseract-ocr/4.00/tessdata/akk-1m.traineddata   --continue_from
data/akk-1m/akk.lstm   --model_output data/akk/checkpoints/akkNEW
--train_listfile data/akk/list.train   --eval_listfile data/akk/list.eval
--max_iterations 1000   --debug_level -1



On Sat, Feb 22, 2020 at 2:52 PM Wincent Balin <[email protected]>
wrote:

> Hello Shree,
>
> I tried that. The command was
>
> lstmtraining   --traineddata data/akk/akk.traineddata   --old_traineddata
> /usr/share/tesseract-ocr/4.00/tessdata/akk-1m.traineddata   --continue_from
> data/akk-1m/akk.lstm   --model_output data/akk/checkpoints/akk
> --train_listfile data/akk/list.train   --eval_listfile data/akk/list.eval
> --max_iterations 1000   --debug_level -1
>
> and the output started with
>
> Loaded file data/akk/checkpoints/akk_checkpoint, unpacking...
> Successfully restored trainer from data/akk/checkpoints/akk_checkpoint
> Loaded 1/1 pages (1-1) of document
> data/akk-ground-truth/P336598.000347.CuneiformComposite.exp0.lstmf
> Loaded 1/1 pages (1-1) of document
> data/akk-ground-truth/P238121.000012.CuneiformNAOutline_Medium.exp0.lstmf
>
> and ended with
>
> Loaded 1/1 pages (1-1) of document
> data/akk-ground-truth/Q005388.000005.Segoe_UI_Historic.exp0.lstmf
> At iteration 4716762/4760600/4760600, Mean rms=1.436%, delta=8.366%, char
> train=105.86%, word train=86.31%, skip ratio=0%,  wrote checkpoint.
>
> Finished! Error rate = 88.246
>
> Do I have have to retrain completely from scratch, meaning without loading
> the previous checkpoint?
>
> Maybe I should check out another approach from yours and try to train with
> one font excluded, so the LSTM converges.
>
> Another thought: I tried training Akkadian with Tesseract 4 once before,
> but with ground truth consisting of short text files with multiple lines of
> text, not one-liners. Obviously I used PSM 6, not PSM 11. Is there anything
> wrong with this approach?
>
>
> Am Montag, 17. Februar 2020 08:23:38 UTC+1 schrieb shree:
>>
>> Try lstmtraining again for 1000 iterations with --debug_level -1
>>
>>
>>
>>
>> On Mon, Feb 17, 2020, 01:46 Wincent Balin <[email protected]> wrote:
>>
>>> Hello all,
>>>
>>> after preparing ground truth files for Akkadian language, I started the
>>> training using the *tesstrain *Makefile, but over 4000000 iterations
>>> later, the output is like following:
>>>
>>> At iteration 4437804/4478900/4478900, Mean rms=1.453%, delta=9.455%,
>>> char train=121.423%, word train=87.461%, skip ratio=0%,  wrote checkpoint.
>>>
>>> Does char train=121% mean CER of 121%? What could be the cause for such
>>> high values even after over 10 days of training?
>>>
>>> Yours truly,
>>>
>>> Wincent
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/79acb8ca-cb51-4e23-8853-ca4b3405a718%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/79acb8ca-cb51-4e23-8853-ca4b3405a718%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c5ccc3c8-f18f-4540-93e8-b55ffb37c3ac%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/c5ccc3c8-f18f-4540-93e8-b55ffb37c3ac%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWkVjK8NaBL57OCdSGCo5hMGwhtwU5uY1GvMKvCfO1n7g%40mail.gmail.com.

Re: [tesseract-ocr] Help for training Akkadian language for Tesseract 4 needed

Reply via email to