> Then I tried to create a starter traineddata file using combine_lang_model script. I used the below command for that,
When you run tesstrain.sh, it creates the starter traineddata using combine_lang_model script. See below for messages from a small test run. + /home/ubuntu/tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts --lang sin --linedata_only --noextract_font_properties --langdata_dir ../langdata_lstm --tessdata_dir ../tessdata_best --fontlist FreeSerif --training_text ../langdata_lstm/sin/sin.training_text --workspace_dir /home/ubuntu/tmp/ --save_box_tiff --maxpages 1 --output_dir ../tesstutorial/sintest === Starting training for language 'sin' [Tue Sep 4 03:21:08 UTC 2018] /home/ubuntu/tesseract/src/training/text2image --fonts_dir=../.fonts --font=FreeSerif --outputbase=/home/ubuntu/tmp//fc-cache/sample_text.txt --text=/home/ubuntu/tmp//fc-cache/sample_text.txt --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache Rendered page 0 to file /home/ubuntu/tmp//fc-cache/sample_text.txt.tif === Phase I: Generating training images === Rendering using FreeSerif [Tue Sep 4 03:21:10 UTC 2018] /home/ubuntu/tesseract/src/training/text2image --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache --fonts_dir=../.fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --max_pages=1 --font=FreeSerif --text=../langdata_lstm/sin/sin.training_text Stripped 1 unrenderable words Rendered page 0 to file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif === Phase UP: Generating unicharset and unichar properties files === [Tue Sep 4 03:21:11 UTC 2018] /home/ubuntu/tesseract/src/training/unicharset_extractor --output_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --norm_mode 2 /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box Extracting unicharset from box file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box Wrote unicharset file /tmp/sin-2018-09-04.Wa5/sin.unicharset [Tue Sep 4 03:21:11 UTC 2018] /home/ubuntu/tesseract/src/training/set_unicharset_properties -U /tmp/sin-2018-09-04.Wa5/sin.unicharset -O /tmp/sin-2018-09-04.Wa5/sin.unicharset -X /tmp/sin-2018-09-04.Wa5/sin.xheights --script_dir=../langdata_lstm Loaded unicharset of size 111 from file /tmp/sin-2018-09-04.Wa5/sin.unicharset Setting unichar properties Setting script properties Warning: properties incomplete for index 7 = ි Warning: properties incomplete for index 9 = ු Warning: properties incomplete for index 17 = ් Warning: properties incomplete for index 19 = ී Warning: properties incomplete for index 38 = ්ර Warning: properties incomplete for index 66 = ₹ Warning: properties incomplete for index 73 = ූ Warning: properties incomplete for index 79 = ්ය Warning: properties incomplete for index 89 = ක් Writing unicharset to file /tmp/sin-2018-09-04.Wa5/sin.unicharset === Phase E: Generating lstmf files === Using TESSDATA_PREFIX=../tessdata_best [Tue Sep 4 03:21:12 UTC 2018] /home/ubuntu/tesseract/src/api/tesseract /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0-beta.4-93-ge4b9c with Leptonica Page 1 === Constructing LSTM training data === [Tue Sep 4 03:21:13 UTC 2018] /home/ubuntu/tesseract/src/training/combine_lang_model --input_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --script_dir ../langdata_lstm --words ../langdata_lstm/sin/sin.wordlist --numbers ../langdata_lstm/sin/sin.numbers --puncs ../langdata_lstm/sin/sin.punc --output_dir ../tesstutorial/sintest --lang sin --pass_through_recoder Loaded unicharset of size 111 from file /tmp/sin-2018-09-04.Wa5/sin.unicharset Setting unichar properties Setting script properties Warning: properties incomplete for index 7 = ි Warning: properties incomplete for index 9 = ු Warning: properties incomplete for index 17 = ් Warning: properties incomplete for index 19 = ී Warning: properties incomplete for index 38 = ්ර Warning: properties incomplete for index 66 = ₹ Warning: properties incomplete for index 73 = ූ Warning: properties incomplete for index 79 = ්ය Warning: properties incomplete for index 89 = ක් Config file is optional, continuing... Failed to read data from: ../langdata_lstm/sin/sin.config Reducing Trie to SquishedDawg Reducing Trie to SquishedDawg Reducing Trie to SquishedDawg === Saving box/tiff pairs for training data === Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box to ../tesstutorial/sintest Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif to ../tesstutorial/sintest === Moving lstmf files for training data === Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.lstmf to ../tesstutorial/sintest Created starter traineddata for language 'sin' Run lstmtraining to do the LSTM training for language 'sin' real 0m5.238s user 0m3.792s sys 0m0.256s On Tue, Sep 4, 2018 at 2:49 AM, Shandigutt <[email protected]> wrote: > Adding more details to my query, > > *My tesseract version:* > tesseract 4.0.0-beta.4-74-gd8237 > leptonica-1.77.0 > libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib > 1.2.11 > Found SSE > > *My OS details,* > tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.1 LTS > Release: 18.04 > Codename: bionic > > Thanks > > On Tuesday, September 4, 2018 at 12:11:50 AM UTC+3, Shandigutt wrote: >> >> Hi, >> >> I'm currently in the process of training Tesseract for new language. I'm >> currently following Tesseract wiki training guidelines >> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>. >> >> Once I build Tesseract from source and installed, I first created my own >> langdata set. >> >> Then I crated training data and eval data using tesstrain.sh script. >> >> Then I tried to create a starter traineddata file >> using combine_lang_model script. I used the below command for that, >> >> *./build/src/training/combine_lang_model --input_unicharset >> ../training/sintrain/sin/sin.unicharset --script_dir ../langdata --words >> ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers >> ../langdata/sin/sin.numbers --output_dir ../training/combined_sin >> --version_str 1.0 --lang sin* >> >> When executing the above command I referred the langdata I created on my >> own for words list, punctuations and numbers. Also I referred the >> unicharset file that was created when creating training data. But I got the >> following error output, >> >> *Loaded unicharset of size 90 from file >> ../training/sintrain/sin/sin.unicharset* >> *Setting unichar properties* >> *Setting script properties* >> *Warning: properties incomplete for index 4 = ී* >> *Warning: properties incomplete for index 6 = ි* >> *Warning: properties incomplete for index 11 = ු* >> *Warning: properties incomplete for index 15 = ්* >> *Warning: properties incomplete for index 30 = ූ* >> *Warning: properties incomplete for index 44 = ්ර* >> *Warning: properties incomplete for index 79 = ්ය* >> *Warning: properties incomplete for index 82 = ක්* >> *Warning: properties incomplete for index 89 = ර්* >> *Error writing unicharset!!* >> >> Can somebody assist me on this. >> >> Thanks >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU6zRnZU_GeeOs7JO0JRi%2BbSifYbHZrX3NR1FsGOacgTQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

