Hi Reza,
Attached are two scripts and one log file. You will need to change the
directories in the scripts.
finetune.sh and finetune log file are for a sample finetuning for eng. By
changing the language code you can run it for fas.
You can use that as a test.
plus-fas.sh is for plusminus type of finetuning for fas. It merges the
existing unicharset with the unicharset extracted from the training_text.
You will need to update the training_text file in langdata/fas
Optionally you can also review and update wordlist, numbers and punc file.
The scripts should run if you give correct directory names.
ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Sat, May 19, 2018 at 9:24 AM, reza <[email protected]> wrote:
> hi ShreeDevi
>
> Thanks.
>
> I tested the 2 models that you have provided. The accuracy on samples
> without noise were about 98% but on scanned samples or captured images,
> were about 80%.
> but still it didn't work on different fonts.
> Could u send all files that needed for training models? I want fine tune
> the model with more fonts and diacritics .
>
> best regards
>
>
> On Friday, May 18, 2018 at 8:49:54 PM UTC+4:30, shree wrote:
>>
>> I have posted a couple of test models for Farsi at
>> https://github.com/Shreeshrii/tessdata_shreetest
>>
>> These have not been trained on text with diacritics as the normalization
>> and training process was giving error on the combining marks.
>>
>> Please give them a try and see if they provide better recognition for
>> numbers and text without combining marks.
>>
>> FYI, I do not know the Persian language so it is difficult for me to
>> gauge if results are ok or not.
>>
>> ShreeDevi
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/fe15cedc-0a2a-41fc-ac3c-b80df458a509%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/fe15cedc-0a2a-41fc-ac3c-b80df458a509%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVC17RjZXSkctsEYW6O6-mO-HAqJHZLZRQcfQsAxwxHeQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
ubuntu@tesseract-ocr:~/tess4training$ bash -x ./tesstrain_finetune.sh
+ MakeTraining=yes
+ MakeEval=yes
+ RunTraining=yes
+ Lang=eng
+ Continue_from_lang=eng
+ bestdata_dir=../tessdata_best
+ tessdata_dir=../tessdata
+ tesstrain_dir=../tesseract/src/training
+ langdata_dir=../langdata
+ fonts_dir=../.fonts
+ fonts_for_training=' '\''FreeSerif'\'' '
+ fonts_for_eval=' '\''Arial'\'' '
+ train_output_dir=./finetune_train_eng
+ eval_output_dir=./finetune_eval_eng
+ trained_output_dir=./finetune_trained_eng-from-eng
+ '[' yes = yes ']'
+ echo '###### MAKING TRAINING DATA ######'
###### MAKING TRAINING DATA ######
+ rm -rf ./finetune_train_eng
+ mkdir ./finetune_train_eng
+ echo '#### run tesstrain.sh ####'
#### run tesstrain.sh ####
+ eval bash ../tesseract/src/training/tesstrain.sh --lang eng --linedata_only
--
noextract_font_properties --exposures 0 --fonts_dir ../.fonts --fontlist
''\''Fr eeSerif'\'''
--langdata_dir ../langdata --tessdata_dir ../tessdata --training_te
xt ../langdata/eng/eng.training_text
--output_dir ./finetune_train_eng
++ bash ../tesseract/src/training/tesstrain.sh --lang eng --linedata_only
--noex
tract_font_properties --exposures 0 --fonts_dir ../.fonts --fontlist FreeSerif
- -langdata_dir
../langdata --tessdata_dir ../tessdata --training_text ../langdata
/eng/eng.training_text --output_dir
./finetune_train_eng
=== Starting training for language 'eng'
[Sat May 19 04:20:00 UTC 2018] /usr/local/bin/text2image --fonts_dir=../.fonts
- -font=FreeSerif
--outputbase=/tmp/font_tmp.rSFglUi6Dq/sample_text.txt --text=/tm
p/font_tmp.rSFglUi6Dq/sample_text.txt
--fontconfig_tmpdir=/tmp/font_tmp.rSFglUi6
Dq
Rendered page 0 to file /tmp/font_tmp.rSFglUi6Dq/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using FreeSerif
[Sat May 19 04:20:02 UTC 2018] /usr/local/bin/text2image
--fontconfig_tmpdir=/tm
p/font_tmp.rSFglUi6Dq --fonts_dir=../.fonts --strip_unrenderable_words
--leading =32
--char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.RsxSMQxxED/eng/eng.Fre
eSerif.exp0 --max_pages=0
--ptsize=12 --font=FreeSerif --text=../langdata/eng/en
g.training_text
Rendered page 0 to file /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.tif
Rendered page 1 to file /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[Sat May 19 04:20:04 UTC 2018] /usr/local/bin/unicharset_extractor
--output_unicharset /tmp/tmp.RsxSMQxxED/eng/eng.unicharset --norm_mode 1
/tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.box
Extracting unicharset from box file
/tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.box
Other case É of é is not in unicharset
Wrote unicharset file /tmp/tmp.RsxSMQxxED/eng/eng.unicharset
[Sat May 19 04:20:04 UTC 2018] /usr/local/bin/set_unicharset_properties -U
/tmp/tmp.RsxSMQxxED/eng/eng.unicharset -O
/tmp/tmp.RsxSMQxxED/eng/eng.unicharset -X /tmp/tmp.RsxSMQxxED/eng/eng.xheights
--script_dir=../langdata
Loaded unicharset of size 111 from file /tmp/tmp.RsxSMQxxED/eng/eng.unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Warning: properties incomplete for index 25 = ~
Writing unicharset to file /tmp/tmp.RsxSMQxxED/eng/eng.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=../tessdata
[Sat May 19 04:20:04 UTC 2018] /usr/local/bin/tesseract
/tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.tif
/tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.1-232-g45a6 with Leptonica
Page 1
Page 2
Loaded 49/49 pages (1-49) of document
/tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.lstmf
=== Constructing LSTM training data ===
[Sat May 19 04:20:07 UTC 2018] /usr/local/bin/combine_lang_model
--input_unicharset /tmp/tmp.RsxSMQxxED/eng/eng.unicharset --script_dir
../langdata --words ../langdata/eng/eng.wordlist --numbers
../langdata/eng/eng.numbers --puncs ../langdata/eng/eng.punc --output_dir
./finetune_train_eng --lang eng
Loaded unicharset of size 111 from file /tmp/tmp.RsxSMQxxED/eng/eng.unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Config file is optional, continuing...
Failed to read data from: ../langdata/eng/eng.config
Null char=2
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Moving /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.box to ./finetune_train_eng
Moving /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.tif to ./finetune_train_eng
Moving /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.lstmf to ./finetune_train_eng
Created starter traineddata for language 'eng'
Run lstmtraining to do the LSTM training for language 'eng'
+ echo '#### combine_tessdata to extract lstm model from '\''tessdata_best'\''
for eng ####'
#### combine_tessdata to extract lstm model from 'tessdata_best' for eng ####
+ combine_tessdata -u ../tessdata_best/eng.traineddata ../tessdata_best/eng.
Extracting tessdata components from ../tessdata_best/eng.traineddata
Wrote ../tessdata_best/eng.lstm
Wrote ../tessdata_best/eng.lstm-punc-dawg
Wrote ../tessdata_best/eng.lstm-word-dawg
Wrote ../tessdata_best/eng.lstm-number-dawg
Wrote ../tessdata_best/eng.lstm-unicharset
Wrote ../tessdata_best/eng.lstm-recoder
Wrote ../tessdata_best/eng.version
Version
string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
+ '[' yes = yes ']'
+ echo '###### MAKING EVAL DATA ######'
###### MAKING EVAL DATA ######
+ rm -rf ./finetune_eval_eng
+ mkdir ./finetune_eval_eng
+ eval bash ../tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts
--fontlist ''\''Arial'\''' --lang eng --linedata_only
--noextract_font_properties --langdata_dir ../langdata --tessdata_dir
../tessdata --training_text ../langdata/eng/eng.training_text --output_dir
./finetune_eval_eng
++ bash ../tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts --fontlist
Arial --lang eng --linedata_only --noextract_font_properties --langdata_dir
../langdata --tessdata_dir ../tessdata --training_text
../langdata/eng/eng.training_text --output_dir ./finetune_eval_eng
=== Starting training for language 'eng'
[Sat May 19 04:20:17 UTC 2018] /usr/local/bin/text2image --fonts_dir=../.fonts
--font=Arial --outputbase=/tmp/font_tmp.2U3WwAANTl/sample_text.txt
--text=/tmp/font_tmp.2U3WwAANTl/sample_text.txt
--fontconfig_tmpdir=/tmp/font_tmp.2U3WwAANTl
Rendered page 0 to file /tmp/font_tmp.2U3WwAANTl/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using Arial
[Sat May 19 04:20:19 UTC 2018] /usr/local/bin/text2image
--fontconfig_tmpdir=/tmp/font_tmp.2U3WwAANTl --fonts_dir=../.fonts
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
--outputbase=/tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0 --max_pages=0 --ptsize=12
--font=Arial --text=../langdata/eng/eng.training_text
Rendered page 0 to file /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.tif
Rendered page 1 to file /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[Sat May 19 04:20:21 UTC 2018] /usr/local/bin/unicharset_extractor
--output_unicharset /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset --norm_mode 1
/tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.box
Extracting unicharset from box file /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.box
Other case É of é is not in unicharset
Wrote unicharset file /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset
[Sat May 19 04:20:21 UTC 2018] /usr/local/bin/set_unicharset_properties -U
/tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset -O
/tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset -X /tmp/tmp.nOUY5Wx7C3/eng/eng.xheights
--script_dir=../langdata
Loaded unicharset of size 111 from file /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Warning: properties incomplete for index 25 = ~
Writing unicharset to file /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=../tessdata
[Sat May 19 04:20:21 UTC 2018] /usr/local/bin/tesseract
/tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.tif
/tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.1-232-g45a6 with Leptonica
Page 1
Page 2
Loaded 52/52 pages (1-52) of document
/tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.lstmf
=== Constructing LSTM training data ===
[Sat May 19 04:20:24 UTC 2018] /usr/local/bin/combine_lang_model
--input_unicharset /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset --script_dir
../langdata --words ../langdata/eng/eng.wordlist --numbers
../langdata/eng/eng.numbers --puncs ../langdata/eng/eng.punc --output_dir
./finetune_eval_eng --lang eng
Loaded unicharset of size 111 from file /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Config file is optional, continuing...
Failed to read data from: ../langdata/eng/eng.config
Null char=2
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Moving /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.box to ./finetune_eval_eng
Moving /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.tif to ./finetune_eval_eng
Moving /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.lstmf to ./finetune_eval_eng
Created starter traineddata for language 'eng'
Run lstmtraining to do the LSTM training for language 'eng'
+ '[' yes = yes ']'
+ echo '#### finetune training from ../tessdata_best/eng.traineddata #####'
#### finetune training from ../tessdata_best/eng.traineddata #####
+ rm -rf ./finetune_trained_eng-from-eng
+ mkdir -p ./finetune_trained_eng-from-eng
+ lstmtraining --continue_from ../tessdata_best/eng.lstm --traineddata
../tessdata_best/eng.traineddata --max_iterations 400 --debug_interval 0
--train_listfile ./finetune_train_eng/eng.training_files.txt --model_output
./finetune_trained_eng-from-eng/finetune
Loaded file ../tessdata_best/eng.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from ../tessdata_best/eng.lstm
Loaded 72/72 pages (1-72) of document
./finetune_train_eng/eng.FreeSerif.exp0.lstmf
2 Percent improvement time=5, best error was 100 @ 0
At iteration 5/100/100, Mean rms=0.198%, delta=0.04%, char train=0.109%, word
train=0.211%, skip ratio=0%, New best char error = 0.109 Transitioned to stage
1 wrote best model:./finetune_trained_eng-from-eng/finetune0.109_5.checkpoint
wrote checkpoint.
2 Percent improvement time=5, best error was 100 @ 0
At iteration 5/200/200, Mean rms=0.17%, delta=0.02%, char train=0.055%, word
train=0.105%, skip ratio=0%, New best char error = 0.055 wrote best
model:./finetune_trained_eng-from-eng/finetune0.055_5.checkpoint wrote
checkpoint.
2 Percent improvement time=5, best error was 100 @ 0
At iteration 5/300/300, Mean rms=0.153%, delta=0.013%, char train=0.036%, word
train=0.07%, skip ratio=0%, New best char error = 0.036 wrote best
model:./finetune_trained_eng-from-eng/finetune0.036_5.checkpoint wrote
checkpoint.
2 Percent improvement time=5, best error was 100 @ 0
At iteration 5/400/400, Mean rms=0.142%, delta=0.01%, char train=0.027%, word
train=0.053%, skip ratio=0%, New best char error = 0.027 wrote best
model:./finetune_trained_eng-from-eng/finetune0.027_5.checkpoint wrote
checkpoint.
Finished! Error rate = 0.027
+ echo '#### Building final trained file ####'
#### Building final trained file ####
+ echo '#### stop training ####'
#### stop training ####
+ lstmtraining --stop_training --continue_from
./finetune_trained_eng-from-eng/finetune_checkpoint --traineddata
../tessdata_best/eng.traineddata --model_output
./finetune_trained_eng-from-eng/eng-finetune.traineddata
Loaded file ./finetune_trained_eng-from-eng/finetune_checkpoint, unpacking...
+ echo '#### eval files with./finetune_train_eng/finetune.traineddata
####'
#### eval files with./finetune_train_eng/finetune.traineddata ####
+ lstmeval --verbosity 0 --model
./finetune_trained_eng-from-eng/eng-finetune.traineddata --eval_listfile
./finetune_eval_eng/eng.training_files.txt
Loaded 72/72 pages (1-72) of document ./finetune_eval_eng/eng.Arial.exp0.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 0, stage 0, Eval Char error rate=0.26994052, Word error
rate=0.5713608
ubuntu@tesseract-ocr:~/tess4training$
#!/bin/bash
# original script by J Klein <[email protected]> - https://pastebin.com/gNLvXkiM
################################################################
# variables to set tasks performed
MakeTraining=yes
MakeEval=yes
RunTraining=yes
################################################################
# Language
Lang=eng
Continue_from_lang=eng
# directory with the old 'best' language training set to continue from eg. Arabic, Latin, Devanagari
#bestdata_dir=../tessdata_best/script
# directory with the old 'best' language training set to continue from eg. ara, eng, san
bestdata_dir=../tessdata_best
# tessdata-dir which has osd.trainddata, eng.traineddata, config and tessconfigs folder and pdf.ttf
tessdata_dir=../tessdata
# directory with training scripts - tesstrain.sh etc.
tesstrain_dir=../tesseract/src/training
# downloaded directory with language data -
langdata_dir=../langdata
# fonts directory for this system
fonts_dir=../.fonts
# fonts to use for training - a minimal set for testing
fonts_for_training=" \
'FreeSerif' \
"
# fonts for computing evals of best fit model
fonts_for_eval=" \
'Arial' \
"
# output directories for this run
train_output_dir=./finetune_train_$Continue_from_lang
eval_output_dir=./finetune_eval_$Continue_from_lang
trained_output_dir=./finetune_trained_$Lang-from-$Continue_from_lang
# fatal bug workaround for pango
#export PANGOCAIRO_BACKEND=fc
if [ $MakeTraining = "yes" ]; then
echo "###### MAKING TRAINING DATA ######"
rm -rf $train_output_dir
mkdir $train_output_dir
echo "#### run tesstrain.sh ####"
# the EVAL handles the quotes in the font list
eval bash $tesstrain_dir/tesstrain.sh \
--lang $Lang \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fonts_dir $fonts_dir \
--fontlist $fonts_for_training \
--langdata_dir $langdata_dir \
--tessdata_dir $tessdata_dir \
--training_text $langdata_dir/$Lang/$Lang.training_text \
--output_dir $train_output_dir
echo "#### combine_tessdata to extract lstm model from 'tessdata_best' for $Continue_from_lang ####"
combine_tessdata -u $bestdata_dir/$Continue_from_lang.traineddata \
$bestdata_dir/$Continue_from_lang.
fi
# at this point, $train_output_dir should have $Lang.FontX.exp0.lstmf
# and $Lang.training_files.txt
# eval data
if [ $MakeEval = "yes" ]; then
echo "###### MAKING EVAL DATA ######"
rm -rf $eval_output_dir
mkdir $eval_output_dir
eval bash $tesstrain_dir/tesstrain.sh \
--fonts_dir $fonts_dir \
--fontlist $fonts_for_eval \
--lang $Lang \
--linedata_only \
--noextract_font_properties \
--langdata_dir $langdata_dir \
--tessdata_dir $tessdata_dir \
--training_text $langdata_dir/$Lang/$Lang.training_text \
--output_dir $eval_output_dir
fi
# at this point, $eval_output_dir should have similar files as
# $train_output_dir but for different font set
if [ $RunTraining = "yes" ]; then
echo "#### finetune training from $bestdata_dir/$Continue_from_lang.traineddata #####"
rm -rf $trained_output_dir
mkdir -p $trained_output_dir
lstmtraining \
--continue_from $bestdata_dir/$Continue_from_lang.lstm \
--traineddata $bestdata_dir/$Continue_from_lang.traineddata \
--max_iterations 400 \
--debug_interval 0 \
--train_listfile $train_output_dir/$Lang.training_files.txt \
--model_output $trained_output_dir/finetune
echo "#### Building final trained file $best_trained_data_file ####"
echo "#### stop training ####"
lstmtraining \
--stop_training \
--continue_from $trained_output_dir/finetune_checkpoint \
--traineddata $bestdata_dir/$Continue_from_lang.traineddata \
--model_output $trained_output_dir/$Lang-finetune.traineddata
echo "#### eval files with$train_output_dir/finetune.traineddata ####"
lstmeval \
--verbosity 0 \
--model $trained_output_dir/$Lang-finetune.traineddata \
--eval_listfile $eval_output_dir/$Lang.training_files.txt
fi
# now $best_trained_data_file is substituted for installed
#!/bin/bash
# based on bash-script by J Klein <[email protected]> - https://pastebin.com/gNLvXkiM
################################################################
# variables to set tasks performed
MakeTraining=yes
MakeEval=yes
RunTraining=yes
################################################################
# Language
Lang=fas
Continue_from_lang=fas
# directory with the old 'best' training set
#bestdata_dir=../tessdata_best/script
bestdata_dir=../tessdata_best
# tessdata directory for config files
tessdata_dir=../tessdata
# directory with training scripts - tesstrain.sh etc.
# this is not the usual place- because they are not installed by default
tesstrain_dir=../tesseract/src/training
# downloaded directory with language data -
langdata_dir=../langdata
# fonts directory for this system
fonts_dir=../.fonts
# fonts to use for training - a minimal set for fast tests
fonts_for_training=" \
'Iranian Sans' \
'Sahel' \
'IranNastaliq-Web' \
'Nesf2' \
'B Koodak Bold' \
'B Lotus' \
'B Lotus Bold' \
'B Nazanin' \
'B Nazanin Bold' \
'B Titr Bold' \
'B Yagut' \
'B Yagut Bold' \
'B Yekan' \
'B Zar' \
'B Zar Bold' \
'Arial Unicode MS' \
'Tahoma' \
"
# fonts for computing evals of best fit model
fonts_for_eval=" \
'B Nazanin' \
'B Yagut' \
'B Zar' \
"
# output directories for this run
train_output_dir=./plus_train_$Lang
eval_output_dir=./plus_eval_$Lang
trained_output_dir=./plus_trained_$Lang-from-$Continue_from_lang
# fatal bug workaround for pango
#export PANGOCAIRO_BACKEND=fc
if [ $MakeTraining = "yes" ]; then
echo "###### MAKING TRAINING DATA ######"
rm -rf $train_output_dir
mkdir $train_output_dir
echo "#### run tesstrain.sh ####"
# the EVAL handles the quotes in the font list
eval bash $tesstrain_dir/tesstrain.sh \
--lang $Lang \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fonts_dir $fonts_dir \
--fontlist $fonts_for_training \
--langdata_dir $langdata_dir \
--tessdata_dir $tessdata_dir \
--training_text $langdata_dir/$Lang/$Lang.training_text \
--output_dir $train_output_dir
echo "#### combine_tessdata to extract lstm model from 'tessdata_best' for $Continue_from_lang ####"
combine_tessdata -u $bestdata_dir/$Continue_from_lang.traineddata \
$bestdata_dir/$Continue_from_lang.
combine_tessdata -u $tessdata_dir/$Lang.traineddata $tessdata_dir/$Lang.
echo "#### build version string ####"
Version_Str="$Lang:plus`date +%Y%m%d`:from:"
sed -e "s/^/$Version_Str/" $bestdata_dir/$Continue_from_lang.version > $train_output_dir/$Lang.new.version
echo "#### merge unicharsets to ensure all existing chars are included ####"
merge_unicharsets \
$bestdata_dir/$Continue_from_lang.lstm-unicharset \
$train_output_dir/$Lang/$Lang.unicharset \
$train_output_dir/$Lang.merged.unicharset
fi
# at this point, $train_output_dir should have $Lang.FontX.exp0.lstmf
# and $Lang.training_files.txt
# eval data
if [ $MakeEval = "yes" ]; then
echo "###### MAKING EVAL DATA ######"
rm -rf $eval_output_dir
mkdir $eval_output_dir
eval bash $tesstrain_dir/tesstrain.sh \
--fonts_dir $fonts_dir \
--fontlist $fonts_for_eval \
--lang $Lang \
--linedata_only \
--noextract_font_properties \
--langdata_dir $langdata_dir \
--tessdata_dir $tessdata_dir \
--training_text $langdata_dir/$Lang/$Lang.training_text \
--output_dir $eval_output_dir
fi
# at this point, $eval_output_dir should have similar files as
# $train_output_dir but for different font set
if [ $RunTraining = "yes" ]; then
echo "#### rebuild starter traineddata ####"
#change these flags based on language
# --lang_is_rtl True \
# --pass_through_recoder True \
#
combine_lang_model \
--input_unicharset $train_output_dir/$Lang/$Lang.merged.unicharset \
--script_dir $langdata_dir \
--words $langdata_dir/$Lang/$Lang.wordlist \
--numbers $langdata_dir/$Lang/$Lang.numbers \
--puncs $langdata_dir/$Lang/$Lang.punc \
--output_dir $train_output_dir \
--pass_through_recoder \
--lang_is_rtl \
--lang $Lang \
--version_str ` cat $train_output_dir/$Lang.new.version`
echo "#### SHREE plus training from $bestdata_dir/$Continue_from_lang.traineddata #####"
rm -rf $trained_output_dir
mkdir -p $trained_output_dir
lstmtraining \
--continue_from $bestdata_dir/$Continue_from_lang.lstm \
--old_traineddata $bestdata_dir/$Continue_from_lang.traineddata \
--traineddata $train_output_dir/$Lang/$Lang.traineddata \
--max_iterations 7000 \
--debug_interval 0 \
--train_listfile $train_output_dir/$Lang.training_files.txt \
--model_output $trained_output_dir/plus
echo "#### Building final trained file $best_trained_data_file ####"
echo "#### stop training ####"
lstmtraining \
--stop_training \
--continue_from $trained_output_dir/plus_checkpoint \
--old_traineddata $bestdata_dir/$Continue_from_lang.traineddata \
--traineddata $train_output_dir/$Lang/$Lang.traineddata \
--model_output $trained_output_dir/$Lang-plus-float.traineddata
cp $trained_output_dir/$Lang-plus-float.traineddata ../tessdata_best/
echo -e "\n #### eval files with $train_output_dir/$Lang-plus-float.traineddata ####"
lstmeval \
--verbosity 0 \
--model $trained_output_dir/$Lang-plus-float.traineddata \
--eval_listfile $eval_output_dir/$Lang.training_files.txt
fi
# now $best_trained_data_file is substituted for installed