Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

ShreeDevi Kumar Tue, 09 Jan 2018 03:18:07 -0800

1. If you use tesstrain.sh, it will create the starter traineddata, you do
NOT need to run combine_lang_data. If you want to change version string,
look at tesstrain_utils.sh and modify the command in it.


2. If you are always getting the same size file, it looks like that you are
probably copying some old file as traineddata as part of your script. It
could be copying from a wrong folder or some such thing.

I am attaching a bash script, you can modify it for your setup and try if
that helps.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jan 9, 2018 at 9:39 AM, <[email protected]> wrote:

> Yes, I did the following command in tesseract/training directory:
>
> lstmtraining --stop_training --continue_from 
> ../result/mylangoutput/base_checkpoint
> --traineddata ../result/mylangcombine/mylang/mylang.traineddata
> --model_output ../result/mylangoutput/mylang.traineddata
>
> On Monday, January 8, 2018 at 7:36:50 PM UTC+7, shree wrote:
>>
>> Did you use --stop_training flag at the end?
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Jan 8, 2018 at 5:51 PM, <[email protected]> wrote:
>>
>>> Hi all,
>>>
>>> I am doing my project using Tesseract v4.00, and always getting the
>>> traineddata output in the same size after training with my own data.
>>> I suppose that I did not do the steps correctly..
>>>
>>> The only data that I provided were:
>>> 1. training_text
>>> 2. puncs (I just reduced the general punc as provided in tesseract
>>> github)
>>> 3. numbers
>>> 4. wordlists (I made various wordlists for several training, ranging
>>> between 100.000 - 2.000.000)
>>> 5. font name (I also made various fonts for several training, ranging
>>> between 1 - 20 fonts)
>>>
>>> The steps that I did were:
>>> 1. Made tiff file, unicharset and other complement data using
>>> tesstrain.sh
>>> 2. Made tiff file, unicharset and other complement data using
>>> tesstrain.sh for evaluation
>>> 3. Combined unicharset, wordlists, puncs, numbers and version_str to
>>> create started traineddata using combine_lang_data ( I am still not
>>> confident with the value of version_str though)
>>> 4. Trained data using lstmtraining
>>> 5. Combined all output file using lstmtraining --continue_from ...
>>>
>>> Yet, all of my training ended with same size which is 10.5MB..
>>> Did I do all my steps correctly?
>>>
>>> Once, I also trained with modifying WORD_DAWG_FACTOR in
>>> language_spesific.sh to 0 and 1, because I want to read the text and match
>>> 100% with my wordlists. But, the result also did not satisfy me, some words
>>> are not in my wordlists such as "USISUSISU".
>>> Do you know whats the cause?
>>>
>>> I really appreciate if anyone can help or suggest any solution.
>>> Thankyou !!
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVfYNirYN3YYr8X_19MufTvT8B3%2BGxwQKsA_MY11zQyZQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

#!/bin/bash

# original script by J Klein <[email protected]> - https://pastebin.com/gNLvXkiM
# based on 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

# Language
Lang=eng

# Number of Iterations
MaxIterations=3000

# directory with training scripts - this is not the usual place
#   because they are not installed by default
tesstrain_dir=./tesseract-training

# directory with the old 'best' training set
tessdata_dir=./tessdata_best

# downloaded directory with language data -
# IMPORTANT - ADD THE NEW CHARS TO langdata/$Lang/$Lang.training_text with
#    about 15 instances per char
langdata_dir=./langdata

# fonts directory for this system
fonts_dir=/mnt/c/Windows/Fonts

# fonts to use for training - a minimal set for fast tests
fonts_for_training="'Arial' \
  'Arial Italic' \
  'Arial Unicode MS' \
  'Times New Roman,' \
  'Times New Roman, Italic'"

  
# fonts for computing evals of best fit model
fonts_for_eval="FreeSerif"

# output directories for this run
train_output_dir=./trained_plus_chars
eval_output_dir=./eval_plus_chars

# the output trained data file to drop into tesseract
final_trained_data_file=$train_output_dir/{$Lang}_NEW.traineddata

# fatal bug workaround for pango
#export  PANGOCAIRO_BACKEND=fc 

################################################################
# variables to set tasks performed
MakeTraining=yes
MakeEval=yes
MakeLSTM=yes
RunTraining=yes
BuildFinalTrainedFile=yes
################################################################

if [ $MakeTraining = "yes" ]; then
    echo "###### MAKING TRAINING DATA ######"
    rm -rf $train_output_dir
    mkdir $train_output_dir

# the EVAL handles the quotes in the font list
eval $tesstrain_dir/tesstrain.sh  \
     --fonts_dir $fonts_dir \
     --fontlist $fonts_for_training \
     --lang $Lang \
     --linedata_only\
     --noextract_font_properties \
     --exposures "0" \
     --langdata_dir $langdata_dir \
     --tessdata_dir $tessdata_dir \
     --output_dir $train_output_dir
fi

# at this point, $train_output_dir should have $Lang.FontX.exp0.lstmf
# and $Lang.training_files.txt


# eval data
if [ $MakeEval = "yes" ]; then
    echo "###### MAKING EVAL DATA ######"
    rm -rf $eval_output_dir
    mkdir $eval_output_dir
    
eval $tesstrain_dir/tesstrain.sh \
     --fonts_dir $fonts_dir\
     --fontlist $fonts_for_eval \
     --lang $Lang \
     --linedata_only \
     --noextract_font_properties \
     --langdata_dir  $langdata_dir \
     --tessdata_dir  $tessdata_dir \
     --output_dir $eval_output_dir
 
fi

# at this point, $eval_output_dir should have similar files as
# $train_output_dir but for different font set

if [ $MakeLSTM = "yes" ]; then
    echo "#### combine_tessdata to extract lstm model from previous trained set 
####"
    
    combine_tessdata \
          -e $tessdata_dir/$Lang.traineddata  \
           $train_output_dir/$Lang.lstm
fi

# at this point, we should have $train_output_dir/$Lang.lstm

if [ $RunTraining = "yes" ]; then
    echo "#### training from previous optimum  #####"
    
    lstmtraining \
        --model_output    $train_output_dir/pluschars \
        --continue_from   $train_output_dir/$Lang.lstm \
        --old_traineddata $tessdata_dir/$Lang.traineddata \
        --traineddata     $train_output_dir/$Lang/$Lang.traineddata \
        --max_iterations $MaxIterations \
        --debug_interval -1 \
        --eval_listfile $eval_output_dir/$Lang.training_files.txt \
        --train_listfile $train_output_dir/$Lang.training_files.txt 
fi


if [ $BuildFinalTrainedFile = "yes" ] ; then
    echo "#### Building final trained file $final_trained_data_file d####"
    
    lstmtraining \
        --stop_training \
        --continue_from $train_output_dir/pluschars_checkpoint \
        --traineddata $train_output_dir/$Lang/$Lang.traineddata \
        --model_output $final_trained_data_file

fi



# now  $final_trained_data_file is substituted for installed

##################### added by shree for testing the new traineddata 

cp $train_output_dir/{$Lang}_NEW.traineddata 
$tessdata_dir/{$Lang}_NEW.traineddata

# now run OCR on ${img_file} and compare output from $Lang and {$Lang}_NEW

img_files=$(ls ./testimage*.png)

for img_file in ${img_files}; do
  echo "****************************" ${img_file} 
"**********************************"
    time tesseract --tessdata-dir $tessdata_dir   ${img_file} 
${img_file%.*}-$Lang  --oem 1 --psm 6 -l $Lang
    time tesseract --tessdata-dir $tessdata_dir   ${img_file} 
${img_file%.*}-{$Lang}_NEW  --oem 1 --psm 6 -l {$Lang}_NEW
done

Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

Reply via email to