Here are the bash script files: 1. for finetune for impact training - add a font 2. for finetune plus-minus training - for adding a new character
On Thu, Jun 21, 2018 at 1:40 AM Shree Devi Kumar <[email protected]> wrote: > Attached is a BASH script for Finetune training for 'Impact' (refer to > Ray's tutorial in wiki for more details). > Use this when you want to finetune a model for a single new font. > > You will need to change the paths for directories and filenames based on > your system. > > The script assumes that you have tesseract 4.0.0-beta installed alongwith > training tools. Refer to wiki main page for info on how to download latest > version of code from PPA etc. > > Please read through the script first, change as needed, create the > required training texts and then run the script. > > #!/bin/bash > ##################################################### > # Script to finetune a language traineddata file for one new font > # for tesseract4.0.0-beta > # Modify directory paths and filenames as required for your setup. > ##################################################### > # Choose which parts of script are to be run? > MakeData=yes > RunTraining=yes > RunEval=yes > ##################################################### > > # Language > Lang=eng > > # downloaded directory with language data > langdata_dir=~/langdata > > # Make about 150 lines of representative training text for finetuning > finetune_training_text=$langdata_dir/$Lang/$Lang.finetune.training_text > > # Make about 150 lines of representative training text for evaluation > eval_training_text=$langdata_dir/$Lang/$Lang.eval.training_text > > # fonts directory for this system > fonts_dir=~/.fonts > > # Finetune training for IMPACT - ONE font ONLY > fonts_for_training=" \ > 'Alanis Hand' \ > " > > # directory with the old 'best' language training set to continue from eg. > ara, eng, san > bestdata_dir=~/tessdata_best > > # tessdata-dir which has osd.trainddata, eng.traineddata, config and > tessconfigs folder and pdf.ttf > tessdata_dir=~/tessdata > > # directory with training scripts - tesstrain.sh etc. > tesstrain_dir=~/tesseract/src/training > > # output directories for this run > trained_output_dir=./$Lang-finetune-impact > eval_output_dir=./$Lang-finetune-impact-eval > > if [ $MakeData = "yes" ]; then > > echo "###### MAKING EVAL DATA ######" > rm -rf $eval_output_dir > mkdir $trained_output_dir > > echo "#### running tesstrain.sh for eval text ####" > > eval bash $tesstrain_dir/tesstrain.sh \ > --lang $Lang \ > --linedata_only \ > --noextract_font_properties \ > --exposures "0" \ > --fonts_dir $fonts_dir \ > --fontlist $fonts_for_training \ > --langdata_dir $langdata_dir \ > --tessdata_dir $tessdata_dir \ > --training_text $eval_training_text \ > --output_dir $eval_output_dir > > echo "###### MAKING TRAINING DATA ######" > rm -rf $trained_output_dir > mkdir $trained_output_dir > > echo "#### running tesstrain.sh for training text ####" > > eval bash $tesstrain_dir/tesstrain.sh \ > --lang $Lang \ > --linedata_only \ > --noextract_font_properties \ > --exposures "0" \ > --fonts_dir $fonts_dir \ > --fontlist $fonts_for_training \ > --langdata_dir $langdata_dir \ > --tessdata_dir $tessdata_dir \ > --training_text $finetune_training_text \ > --output_dir $trained_output_dir > > echo "#### running combine_tessdata to extract lstm model from > 'tessdata_best' for $Lang ####" > > combine_tessdata -e $bestdata_dir/$Lang.traineddata > $bestdata_dir/$Lang.lstm > > fi > > if [ $RunTraining = "yes" ]; then > > echo "###### LSTM TRAINING ######" > > echo "#### running lstmtraining for finetuning from > $bestdata_dir/$Lang.traineddata #####" > > lstmtraining \ > --continue_from $bestdata_dir/$Lang.lstm \ > --traineddata $bestdata_dir/$Lang.traineddata \ > --max_iterations 1000 \ > --debug_interval 0 \ > --train_listfile $trained_output_dir/$Lang.training_files.txt \ > --model_output $trained_output_dir/finetune > > echo "###### BUILD FINETUNED MODEL ######" > > echo "#### Building final trained file $Lang-finetune-$Lang.traineddata > ####" > > lstmtraining \ > --stop_training \ > --continue_from $trained_output_dir/finetune_checkpoint \ > --traineddata $bestdata_dir/$Lang.traineddata \ > --model_output "$trained_output_dir/$Lang-finetune-$Lang.traineddata" > > fi > > if [ $RunEval = "yes" ]; then > > echo "###### EVAL ORIGINAL MODEL ######" > > lstmeval \ > --model $bestdata_dir/$Lang.traineddata \ > --eval_listfile $eval_output_dir/$Lang.training_files.txt \ > --verbosity 0 > > echo "###### EVAL FINETUNED MODEL ######" > > lstmeval \ > --model $trained_output_dir/$Lang-finetune-$Lang.traineddata \ > --eval_listfile $eval_output_dir/$Lang.training_files.txt \ > --verbosity 0 > > fi > > > On Wed, Jun 20, 2018 at 9:14 PM Shree Devi Kumar <[email protected]> > wrote: > >> >> https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05 >> >> >> https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh >> >> I haven't trained with tesseract 3 for a while. I willpost instructions >> for tesseract4 later. >> >> On Wed, Jun 20, 2018 at 9:05 PM Navaneetha Bitla <[email protected]> >> wrote: >> >>> can you help us by saying how to train with tesstrain.sh >>> >>> It will help all of us, we are thankful to you. >>> >>> On Wed, Jun 20, 2018 at 8:59 PM, Shree Devi Kumar <[email protected]> >>> wrote: >>> >>>> You will have better control on training if you use tesstrain.sh >>>> provided with tesseract. >>>> >>>> On Wed, Jun 20, 2018 at 8:52 PM Navaneetha Bitla <[email protected]> >>>> wrote: >>>> >>>>> http://www.1001fonts.com/handwritten-fonts.html. >>>>> >>>>> the above link has 1900+ fonts from that site i have downloaded the >>>>> ttf files of fonts and converted to tiff files online. >>>>> >>>>> then i have trained the tiff files(fonts) using serak trainer. >>>>> >>>>> >>>>> If you got the accuracy just forward the results so everyone can konw >>>>> and will follw you. >>>>> >>>>> Thank you >>>>> >>>>> On Wed, Jun 20, 2018 at 3:13 PM, James Q <[email protected]> >>>>> wrote: >>>>> >>>>>> I'm going to be using tesseract 4 and using the tesstrain.sh script. >>>>>> If I come across things that improve accuracy though I will let you know. >>>>>> >>>>>> Where did you find 1300 handwriting fonts? >>>>>> >>>>>> On Tuesday, June 19, 2018 at 5:19:54 PM UTC+1, Navaneetha Bitla wrote: >>>>>>> >>>>>>> serak trainer using training tesseract 3.5. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jun 19, 2018 at 9:29 PM, James Q <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Navaneetha >>>>>>>> I am also looking to start training tesseract using handwritten >>>>>>>> fonts and am about to start setting up my training environment. Are you >>>>>>>> training tesseract 4 by following the guide at >>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 >>>>>>>> ? >>>>>>>> >>>>>>>> If so are you fine tuning the existing english model, retraining >>>>>>>> just the top layer(s) or training from scratch with your additional >>>>>>>> fonts? >>>>>>>> >>>>>>>> Thanks >>>>>>>> Jim >>>>>>>> >>>>>>>> On Tuesday, June 19, 2018 at 10:30:30 AM UTC+1, Navaneetha Bitla >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, this is Navaneetha >>>>>>>>> >>>>>>>>> i'm working in hand written character recognition project. >>>>>>>>> >>>>>>>>> I have trained 1300 different hand written fonts of english and >>>>>>>>> moved the files into tessdata directory. >>>>>>>>> >>>>>>>>> tested tesseract using the below commands: >>>>>>>>> >>>>>>>>> $convert -density 300 input.png -depth 8 -strip -background white >>>>>>>>> -alpha off out.tiff >>>>>>>>> >>>>>>>>> $tesseract out.tiff eng >>>>>>>>> >>>>>>>>> The input.png is of Alanis Handa font and i have trained this font >>>>>>>>> but i'm not getting atleast 40% accuracy. >>>>>>>>> >>>>>>>>> Can someone help me. >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks in advance. >>>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To post to this group, send email to [email protected]. >>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/253906ac-fedf-4364-ad70-e745b8786c0d%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/253906ac-fedf-4364-ad70-e745b8786c0d%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/29a1bc53-d127-407b-8611-0652821a0707%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/29a1bc53-d127-407b-8611-0652821a0707%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CABbi8QfEe2r%2BynHHEGfr8_b-x5KOf2yJ1xr%2Be7e1sDCKxqUFXA%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CABbi8QfEe2r%2BynHHEGfr8_b-x5KOf2yJ1xr%2Be7e1sDCKxqUFXA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU4w%2BjPakoNOdzq6QyS3nF9rAp9gHSPUkKddioZTXsgyw%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU4w%2BjPakoNOdzq6QyS3nF9rAp9gHSPUkKddioZTXsgyw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CABbi8Qdg6FhUbL9ZznVNikY-CS9PcYCoWWeM_7OJNuq7BLMgUA%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CABbi8Qdg6FhUbL9ZznVNikY-CS9PcYCoWWeM_7OJNuq7BLMgUA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWE-H37wg-J9u9H9uRNx%2B1ttPQe_WZ3%2BLchNgr8Z%2BEXPA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
#!/bin/bash ##################################################### # Script to finetune a language traineddata file for one new font # for tesseract4.0.0-beta # Modify directory paths and filenames as required for your setup. ##################################################### # Choose which parts of script are to be run? MakeData=yes RunTraining=yes RunEval=yes ##################################################### # Language Lang=eng # downloaded directory with language data langdata_dir=~/langdata # Make about 150 lines of representative training text for finetuning finetune_training_text=$langdata_dir/$Lang/$Lang.finetune.training_text # Make small representative text for evaluation eval_training_text=$langdata_dir/$Lang/$Lang.eval.training_text # fonts directory for this system fonts_dir=~/.fonts # Finetune training for IMPACT - ONE font ONLY fonts_for_training=" \ 'Alanis Hand' \ " # directory with the old 'best' language training set to continue from eg. ara, eng, san bestdata_dir=~/tessdata_best # tessdata-dir which has osd.trainddata, eng.traineddata, config and tessconfigs folder and pdf.ttf tessdata_dir=~/tessdata # directory with training scripts - tesstrain.sh etc. tesstrain_dir=~/tesseract/src/training # output directories for this run trained_output_dir=./$Lang-finetune-impact eval_output_dir=./$Lang-finetune-impact-eval if [ $MakeData = "yes" ]; then echo "###### MAKING EVAL DATA ######" rm -rf $eval_output_dir mkdir $trained_output_dir echo "#### running tesstrain.sh for eval text ####" eval bash $tesstrain_dir/tesstrain.sh \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_training \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --training_text $eval_training_text \ --output_dir $eval_output_dir echo "###### MAKING TRAINING DATA ######" rm -rf $trained_output_dir mkdir $trained_output_dir echo "#### running tesstrain.sh for training text ####" eval bash $tesstrain_dir/tesstrain.sh \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_training \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --training_text $finetune_training_text \ --output_dir $trained_output_dir echo "#### running combine_tessdata to extract lstm model from 'tessdata_best' for $Lang ####" combine_tessdata -e $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang.lstm fi if [ $RunTraining = "yes" ]; then echo "###### LSTM TRAINING ######" echo "#### running lstmtraining for finetuning from $bestdata_dir/$Lang.traineddata #####" lstmtraining \ --continue_from $bestdata_dir/$Lang.lstm \ --traineddata $bestdata_dir/$Lang.traineddata \ --max_iterations 1000 \ --debug_interval 0 \ --train_listfile $trained_output_dir/$Lang.training_files.txt \ --model_output $trained_output_dir/finetune echo "###### BUILD FINETUNED MODEL ######" echo "#### Building final trained file $Lang-finetune-$Lang.traineddata ####" lstmtraining \ --stop_training \ --continue_from $trained_output_dir/finetune_checkpoint \ --traineddata $bestdata_dir/$Lang.traineddata \ --model_output "$trained_output_dir/$Lang-finetune-$Lang.traineddata" fi if [ $RunEval = "yes" ]; then echo "###### EVAL ORIGINAL MODEL ######" lstmeval \ --model $bestdata_dir/$Lang.traineddata \ --eval_listfile $eval_output_dir/$Lang.training_files.txt \ --verbosity 0 echo "###### EVAL FINETUNED MODEL ######" lstmeval \ --model $trained_output_dir/$Lang-finetune-$Lang.traineddata \ --eval_listfile $eval_output_dir/$Lang.training_files.txt \ --verbosity 0 fi
#!/bin/bash ##################################################### # Script to finetune a language traineddata file for tesseract4.0.0-beta # Finetune training for adding a couple new characters (PLUS_MINUS) # Modify directory paths and filenames as required for your setup. ##################################################### # Choose which parts of script are to be run? MakeData=yes RunTraining=yes RunEval=yes ##################################################### # frk.traineddata is for German text in Fraktur/Blackletter Print # Choose fonts which have this style of print, 'findfonts' won't work. # Language Lang=frk # Other variables for training MaxIterations=5000 DebugInterval=-1 # downloaded directory with language data - langdata_dir=~/langdata # About 100 lines of representative training text for finetuning # Include about 15-20 samples of the new character to be added # for example add Rupee symbol to English traineddata finetune_training_text=$langdata_dir/$Lang/$Lang.plus.training_text # Make a small representative text for evaluation eval_training_text=$langdata_dir/$Lang/$Lang.eval.training_text # fonts directory for this system fonts_dir=~/.fonts # Use multiple font for PLUS-MINUS finetune training fonts_for_training=" \ 'LOB.BreitkopfFraktur' \ 'Schmale_Anzeigenschrift' \ " # directory with the old 'best' language training set to continue from eg. ara, eng, san bestdata_dir=~/tessdata_best # tessdata-dir which has osd.trainddata, eng.traineddata, config and tessconfigs folder and pdf.ttf tessdata_dir=~/tessdata # directory with training scripts - tesstrain.sh etc. tesstrain_dir=~/tesseract/src/training # output directories for this run trained_output_dir=./$Lang-finetune-plus eval_output_dir=./$Lang-finetune-plus-eval if [ $MakeData = "yes" ]; then echo "###### MAKING EVAL DATA ######" rm -rf $eval_output_dir mkdir $trained_output_dir echo "#### running tesstrain.sh for eval text ####" eval bash $tesstrain_dir/tesstrain.sh \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_training \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --training_text $eval_training_text \ --output_dir $eval_output_dir echo "###### MAKING TRAINING DATA ######" rm -rf $trained_output_dir mkdir $trained_output_dir echo "#### running tesstrain.sh for training text ####" eval bash $tesstrain_dir/tesstrain.sh \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_training \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --training_text $finetune_training_text \ --output_dir $trained_output_dir fi if [ $MergeData = "yes" ]; then echo "#### running combine_tessdata to extract lstm model from 'tessdata_best' for $Lang ####" combine_tessdata -u $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang. echo "#### build version string ####" Version_Str="$Lang:PLUS`date +%Y%m%d`:from:" sed -e "s/^/$Version_Str/" $bestdata_dir/$Lang.version > $trained_output_dir/$Lang.new.version echo "#### merge unicharsets to ensure all existing chars are included ####" merge_unicharsets \ $bestdata_dir/$Lang.lstm-unicharset \ $trained_output_dir/$Lang/$Lang.unicharset \ $trained_output_dir/$Lang.continue.unicharset echo "#### rebuild starter traineddata ####" combine_lang_model \ --input_unicharset $trained_output_dir/$Lang.continue.unicharset \ --script_dir $langdata_dir \ --words $langdata_dir/$Lang/$Lang.wordlist \ --numbers $langdata_dir/$Lang/$Lang.numbers \ --puncs $langdata_dir/$Lang/$Lang.punc \ --output_dir $trained_output_dir \ --lang $Lang \ --version_str ` cat $trained_output_dir/$Lang.new.version` fi if [ $RunTraining = "yes" ]; then echo "###### Running LSTM TRAINING ######" echo "###### for PLUS-MINUS finetuning from $bestdata_dir/$Lang.traineddata #####" lstmtraining \ --continue_from $bestdata_dir/$Lang.lstm \ --old_traineddata $bestdata_dir/$Lang.traineddata \ --traineddata $trained_output_dir/$Lang/$Lang.traineddata \ --max_iterations $MaxIterations \ --debug_interval $DebugInterval \ --train_listfile $trained_output_dir/$Lang.training_files.txt \ --model_output $trained_output_dir/finetune_plus echo "###### BUILD PLUS-MINUS FINETUNED MODEL ######" echo "###### STOP training $trained_output_dir/$Lang-PLUS.traineddata ####" lstmtraining \ --stop_training \ --continue_from $trained_output_dir/finetune_plus_checkpoint \ --old_traineddata $bestdata_dir/$Lang.traineddata \ --traineddata $trained_output_dir/$Lang/$Lang.traineddata \ --model_output "$trained_output_dir/$Lang-PLUS.traineddata" fi if [ $RunEval = "yes" ]; then echo "###### EVAL ORIGINAL MODEL ######" lstmeval \ --model $bestdata_dir/$Lang.traineddata \ --eval_listfile $eval_output_dir/$Lang.training_files.txt \ --verbosity 0 echo "###### EVAL PLUS-MINUS FINETUNED MODEL ######" lstmeval \ --model $trained_output_dir/$Lang-PLUS.traineddata \ --eval_listfile $eval_output_dir/$Lang.training_files.txt \ --verbosity 0 fi

