Re: [tesseract-ocr] Training with a large number of LSTMF files

Lorenzo Bolzani Tue, 11 Sep 2018 08:28:38 -0700

Hi, I trained with about 50k very short samples with no problems, going up
to 50k iterations in several steps.


My suggestion is to train for a few iterations (like 1000), check the
accuracy on the validation set (not on the training set), then set the next
target to 2000 (so it trains 1000 more), etc. and stop when it peaks.

I suppose, but I'm not sure about this, that the subset of files is
randomized, so it picks a different set on each run. I hope so or I have to
do it all over again... Please let me know if you should find this out.

See here for more details on the train/check loop:

https://groups.google.com/d/msg/tesseract-ocr/be4-rjvY2tQ/dlRK6t6lCgAJ

About the number of iterations: I think you cannot compute it and it is not
so important to visit all the samples. Each sample contains a lot of
letters, with different frequencies.  Even if you do not use all the
samples each letter is seen many times.
If your samples are generated from a normal static font all the characters
are identical and the extra samples just add more cases of letter sequences
and splits between the words and even these, after a while, will start to
repeat. This is not much different than to train on the very same samples
many times and leads to overfitting.
Rather than trying to guess or calculate the iterations I think it's better
to just measure the result.


Bye

Lorenzo

Il giorno mar 11 set 2018 alle ore 14:57 ProgressNotPerfection <
[email protected]> ha scritto:

> Hi Tesseract Group
> I am trying to train tesseract to recognize handwritten characters and
> have prepared several thousand lstmf files (from tif/box sets) so I can
> finetune best trained eng.traineddata, I read elsewhere on this forum that
> a low number (say 300 - 400) if iterations is recommended when finetuning
> to avoid overfitting. In my case though it appears that if I choose a low
> number of iterations, only (approximately) that number of lstmf files get
> loaded by the training process. I assumed that each iteration would be a
> training pass over all the lstmf files. Below is my script (which assumes
> my lstmf files are ready in trained_output_dir). How should I amend this so
> that it loads all my lstmf files? Should the number of iterations be
> greater than the number of lstmf files? ... or is there a maximum number of
> lstmf files that can used for training at once?
>
> Any help would be much appreciated
> Thanks
>
> #! /bin/bash
> #####################################################
> # Script to finetune a language traineddata file for a set of
> # pre built lstmf files and a starter traineddata
> # for tesseract4.0.0-beta
> # Modify directory paths and filenames as required for your setup.
> #####################################################
>
> Lang=eng
> bestdata_dir=~/tesseract-ocr/tessdata_best
> tesstrain_dir=~/tesseract-ocr/src/training
> trained_output_dir=~/tesseract-ocr/src/training/eng-finetune-impact
>
> echo "###### EXTRACT BEST LSTM MODEL ######"
> combine_tessdata -e $bestdata_dir/$Lang.traineddata
> $bestdata_dir/$Lang.lstm
>
> echo "###### LSTM TRAINING ######"
> echo "#### running lstmtraining for finetuning from
> $bestdata_dir/$Lang.traineddata #####"
>
> lstmtraining \
> --continue_from  $bestdata_dir/$Lang.lstm \
> --net_spec '[1,49,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c78]' \
> --old_traineddata  $bestdata_dir/$Lang.traineddata \
> --traineddata    $trained_output_dir/$Lang/$Lang.traineddata \
> --max_iterations 400 \
> --debug_interval 0 \
> --train_listfile $trained_output_dir/$Lang.training_files.txt \
> --model_output  $trained_output_dir/finetune
>
> echo "###### BUILD FINETUNED MODEL ######"
> echo "#### Building final trained file $Lang-finetune-$Lang.traineddata
> ####"
> lstmtraining \
> --stop_training \
> --continue_from $trained_output_dir/finetune_checkpoint \
> --old_traineddata  $bestdata_dir/$Lang.traineddata \
> --traineddata    $trained_output_dir/$Lang/$Lang.traineddata \
> --model_output "$trained_output_dir/$Lang-finetune-$Lang.traineddata"
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2ccbe310-2cc1-4ee9-b724-e1551d0e7daf%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/2ccbe310-2cc1-4ee9-b724-e1551d0e7daf%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzEGrLb9BeiMhKLrQtLe_4GLztNz4W4b-J-mFiGg3t%2Bdw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Training with a large number of LSTMF files

Reply via email to