I have to train Tesseract on images of a few symbols like '?,<,' etc.
Following [docs][1] for 4.0, I just tested this step:
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
--linedata_only
\
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
which actually does the following steps:
/usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.cGLxwSj3wP --
fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --
char_spacing=0.0 --exposure=0 --outputbase=/tmp/eng-2019-01-12.dy8/eng.
FreeMono.exp0 --max_pages=0 --font=FreeMono --text=../langdata/eng/eng.
training_text
/usr/local/bin/unicharset_extractor --output_unicharset /tmp/eng-2019-01
-12.dy8/eng.unicharset --norm_mode 1 /tmp/eng-2019-01-12.dy8/eng.FreeMono.
exp0.box
/usr/local/bin/set_unicharset_properties -U
/tmp/eng-2019-01-12.dy8/eng.unicharset
-O /tmp/eng-2019-01-12.dy8/eng.unicharset -X
/tmp/eng-2019-01-12.dy8/eng.xheights
--script_dir=../langdata
/usr/local/bin/tesseract /tmp/eng-2019-01-12.dy8/eng.FreeMono.exp0.tif /
tmp/eng-2019-01-12.dy8/eng.FreeMono.exp0 --psm 6 lstm.train
/usr/local/bin/combine_lang_model --input_unicharset /tmp/eng-2019-01-
12.dy8/eng.unicharset --script_dir ../langdata --words
../langdata/eng/eng.wordlist
--numbers ../langdata/eng/eng.numbers --puncs ../langdata/eng/eng.punc
--output_dir
/home/faizan/tesstutorial/engtrain --lang eng
So, if I run all these steps individually and start from step 2 in my case
as I have the tif images and I can just create box files using any GUI
Tool. So, is that all? I mean do I have to only move the `eng.traineddata`
file to `tessdata` folder? Or, there are more steps to be followed like
this?
training/lstmtraining --debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256
O1c111]' \
--model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/c6f8722e-2157-4c8b-a7c7-8ea769d17243%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.