Hi, I trained with about 50k very short samples with no problems, going up to 50k iterations in several steps.
My suggestion is to train for a few iterations (like 1000), check the accuracy on the validation set (not on the training set), then set the next target to 2000 (so it trains 1000 more), etc. and stop when it peaks. I suppose, but I'm not sure about this, that the subset of files is randomized, so it picks a different set on each run. I hope so or I have to do it all over again... Please let me know if you should find this out. See here for more details on the train/check loop: https://groups.google.com/d/msg/tesseract-ocr/be4-rjvY2tQ/dlRK6t6lCgAJ About the number of iterations: I think you cannot compute it and it is not so important to visit all the samples. Each sample contains a lot of letters, with different frequencies. Even if you do not use all the samples each letter is seen many times. If your samples are generated from a normal static font all the characters are identical and the extra samples just add more cases of letter sequences and splits between the words and even these, after a while, will start to repeat. This is not much different than to train on the very same samples many times and leads to overfitting. Rather than trying to guess or calculate the iterations I think it's better to just measure the result. Bye Lorenzo Il giorno mar 11 set 2018 alle ore 14:57 ProgressNotPerfection < [email protected]> ha scritto: > Hi Tesseract Group > I am trying to train tesseract to recognize handwritten characters and > have prepared several thousand lstmf files (from tif/box sets) so I can > finetune best trained eng.traineddata, I read elsewhere on this forum that > a low number (say 300 - 400) if iterations is recommended when finetuning > to avoid overfitting. In my case though it appears that if I choose a low > number of iterations, only (approximately) that number of lstmf files get > loaded by the training process. I assumed that each iteration would be a > training pass over all the lstmf files. Below is my script (which assumes > my lstmf files are ready in trained_output_dir). How should I amend this so > that it loads all my lstmf files? Should the number of iterations be > greater than the number of lstmf files? ... or is there a maximum number of > lstmf files that can used for training at once? > > Any help would be much appreciated > Thanks > > #! /bin/bash > ##################################################### > # Script to finetune a language traineddata file for a set of > # pre built lstmf files and a starter traineddata > # for tesseract4.0.0-beta > # Modify directory paths and filenames as required for your setup. > ##################################################### > > Lang=eng > bestdata_dir=~/tesseract-ocr/tessdata_best > tesstrain_dir=~/tesseract-ocr/src/training > trained_output_dir=~/tesseract-ocr/src/training/eng-finetune-impact > > echo "###### EXTRACT BEST LSTM MODEL ######" > combine_tessdata -e $bestdata_dir/$Lang.traineddata > $bestdata_dir/$Lang.lstm > > echo "###### LSTM TRAINING ######" > echo "#### running lstmtraining for finetuning from > $bestdata_dir/$Lang.traineddata #####" > > lstmtraining \ > --continue_from $bestdata_dir/$Lang.lstm \ > --net_spec '[1,49,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c78]' \ > --old_traineddata $bestdata_dir/$Lang.traineddata \ > --traineddata $trained_output_dir/$Lang/$Lang.traineddata \ > --max_iterations 400 \ > --debug_interval 0 \ > --train_listfile $trained_output_dir/$Lang.training_files.txt \ > --model_output $trained_output_dir/finetune > > echo "###### BUILD FINETUNED MODEL ######" > echo "#### Building final trained file $Lang-finetune-$Lang.traineddata > ####" > lstmtraining \ > --stop_training \ > --continue_from $trained_output_dir/finetune_checkpoint \ > --old_traineddata $bestdata_dir/$Lang.traineddata \ > --traineddata $trained_output_dir/$Lang/$Lang.traineddata \ > --model_output "$trained_output_dir/$Lang-finetune-$Lang.traineddata" > > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/2ccbe310-2cc1-4ee9-b724-e1551d0e7daf%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/2ccbe310-2cc1-4ee9-b724-e1551d0e7daf%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzEGrLb9BeiMhKLrQtLe_4GLztNz4W4b-J-mFiGg3t%2Bdw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

