1. If you use tesstrain.sh, it will create the starter traineddata, you do NOT need to run combine_lang_data. If you want to change version string, look at tesstrain_utils.sh and modify the command in it.
2. If you are always getting the same size file, it looks like that you are probably copying some old file as traineddata as part of your script. It could be copying from a wrong folder or some such thing. I am attaching a bash script, you can modify it for your setup and try if that helps. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Jan 9, 2018 at 9:39 AM, <[email protected]> wrote: > Yes, I did the following command in tesseract/training directory: > > lstmtraining --stop_training --continue_from > ../result/mylangoutput/base_checkpoint > --traineddata ../result/mylangcombine/mylang/mylang.traineddata > --model_output ../result/mylangoutput/mylang.traineddata > > On Monday, January 8, 2018 at 7:36:50 PM UTC+7, shree wrote: >> >> Did you use --stop_training flag at the end? >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Mon, Jan 8, 2018 at 5:51 PM, <[email protected]> wrote: >> >>> Hi all, >>> >>> I am doing my project using Tesseract v4.00, and always getting the >>> traineddata output in the same size after training with my own data. >>> I suppose that I did not do the steps correctly.. >>> >>> The only data that I provided were: >>> 1. training_text >>> 2. puncs (I just reduced the general punc as provided in tesseract >>> github) >>> 3. numbers >>> 4. wordlists (I made various wordlists for several training, ranging >>> between 100.000 - 2.000.000) >>> 5. font name (I also made various fonts for several training, ranging >>> between 1 - 20 fonts) >>> >>> The steps that I did were: >>> 1. Made tiff file, unicharset and other complement data using >>> tesstrain.sh >>> 2. Made tiff file, unicharset and other complement data using >>> tesstrain.sh for evaluation >>> 3. Combined unicharset, wordlists, puncs, numbers and version_str to >>> create started traineddata using combine_lang_data ( I am still not >>> confident with the value of version_str though) >>> 4. Trained data using lstmtraining >>> 5. Combined all output file using lstmtraining --continue_from ... >>> >>> Yet, all of my training ended with same size which is 10.5MB.. >>> Did I do all my steps correctly? >>> >>> Once, I also trained with modifying WORD_DAWG_FACTOR in >>> language_spesific.sh to 0 and 1, because I want to read the text and match >>> 100% with my wordlists. But, the result also did not satisfy me, some words >>> are not in my wordlists such as "USISUSISU". >>> Do you know whats the cause? >>> >>> I really appreciate if anyone can help or suggest any solution. >>> Thankyou !! >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40goo >>> glegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVfYNirYN3YYr8X_19MufTvT8B3%2BGxwQKsA_MY11zQyZQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
#!/bin/bash # original script by J Klein <[email protected]> - https://pastebin.com/gNLvXkiM # based on https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters # Language Lang=eng # Number of Iterations MaxIterations=3000 # directory with training scripts - this is not the usual place # because they are not installed by default tesstrain_dir=./tesseract-training # directory with the old 'best' training set tessdata_dir=./tessdata_best # downloaded directory with language data - # IMPORTANT - ADD THE NEW CHARS TO langdata/$Lang/$Lang.training_text with # about 15 instances per char langdata_dir=./langdata # fonts directory for this system fonts_dir=/mnt/c/Windows/Fonts # fonts to use for training - a minimal set for fast tests fonts_for_training="'Arial' \ 'Arial Italic' \ 'Arial Unicode MS' \ 'Times New Roman,' \ 'Times New Roman, Italic'" # fonts for computing evals of best fit model fonts_for_eval="FreeSerif" # output directories for this run train_output_dir=./trained_plus_chars eval_output_dir=./eval_plus_chars # the output trained data file to drop into tesseract final_trained_data_file=$train_output_dir/{$Lang}_NEW.traineddata # fatal bug workaround for pango #export PANGOCAIRO_BACKEND=fc ################################################################ # variables to set tasks performed MakeTraining=yes MakeEval=yes MakeLSTM=yes RunTraining=yes BuildFinalTrainedFile=yes ################################################################ if [ $MakeTraining = "yes" ]; then echo "###### MAKING TRAINING DATA ######" rm -rf $train_output_dir mkdir $train_output_dir # the EVAL handles the quotes in the font list eval $tesstrain_dir/tesstrain.sh \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_training \ --lang $Lang \ --linedata_only\ --noextract_font_properties \ --exposures "0" \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --output_dir $train_output_dir fi # at this point, $train_output_dir should have $Lang.FontX.exp0.lstmf # and $Lang.training_files.txt # eval data if [ $MakeEval = "yes" ]; then echo "###### MAKING EVAL DATA ######" rm -rf $eval_output_dir mkdir $eval_output_dir eval $tesstrain_dir/tesstrain.sh \ --fonts_dir $fonts_dir\ --fontlist $fonts_for_eval \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --output_dir $eval_output_dir fi # at this point, $eval_output_dir should have similar files as # $train_output_dir but for different font set if [ $MakeLSTM = "yes" ]; then echo "#### combine_tessdata to extract lstm model from previous trained set ####" combine_tessdata \ -e $tessdata_dir/$Lang.traineddata \ $train_output_dir/$Lang.lstm fi # at this point, we should have $train_output_dir/$Lang.lstm if [ $RunTraining = "yes" ]; then echo "#### training from previous optimum #####" lstmtraining \ --model_output $train_output_dir/pluschars \ --continue_from $train_output_dir/$Lang.lstm \ --old_traineddata $tessdata_dir/$Lang.traineddata \ --traineddata $train_output_dir/$Lang/$Lang.traineddata \ --max_iterations $MaxIterations \ --debug_interval -1 \ --eval_listfile $eval_output_dir/$Lang.training_files.txt \ --train_listfile $train_output_dir/$Lang.training_files.txt fi if [ $BuildFinalTrainedFile = "yes" ] ; then echo "#### Building final trained file $final_trained_data_file d####" lstmtraining \ --stop_training \ --continue_from $train_output_dir/pluschars_checkpoint \ --traineddata $train_output_dir/$Lang/$Lang.traineddata \ --model_output $final_trained_data_file fi # now $final_trained_data_file is substituted for installed ##################### added by shree for testing the new traineddata cp $train_output_dir/{$Lang}_NEW.traineddata $tessdata_dir/{$Lang}_NEW.traineddata # now run OCR on ${img_file} and compare output from $Lang and {$Lang}_NEW img_files=$(ls ./testimage*.png) for img_file in ${img_files}; do echo "****************************" ${img_file} "**********************************" time tesseract --tessdata-dir $tessdata_dir ${img_file} ${img_file%.*}-$Lang --oem 1 --psm 6 -l $Lang time tesseract --tessdata-dir $tessdata_dir ${img_file} ${img_file%.*}-{$Lang}_NEW --oem 1 --psm 6 -l {$Lang}_NEW done

