Thank you very much for sorting things out Shree. But I have one more question
When I run tesstrain.sh I don't pass my words list, punctuation and numbers as input parameters. But I keep those files in the langdata folder. So when it executes combine_lang_model internally does it pas these files as arguments to combine_lang_model script? Now since this step is completed can I move straight to running lstmtraining script? On Tuesday, September 4, 2018 at 6:25:37 AM UTC+3, shree wrote: > > > Then I tried to create a starter traineddata file > using combine_lang_model script. I used the below command for that, > > When you run tesstrain.sh, it creates the starter traineddata using > combine_lang_model > script. > > See below for messages from a small test run. > > + /home/ubuntu/tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts > --lang sin --linedata_only --noextract_font_properties --langdata_dir > ../langdata_lstm --tessdata_dir ../tessdata_best --fontlist FreeSerif > --training_text ../langdata_lstm/sin/sin.training_text --workspace_dir > /home/ubuntu/tmp/ --save_box_tiff --maxpages 1 --output_dir > ../tesstutorial/sintest > > === Starting training for language 'sin' > [Tue Sep 4 03:21:08 UTC 2018] > /home/ubuntu/tesseract/src/training/text2image --fonts_dir=../.fonts > --font=FreeSerif --outputbase=/home/ubuntu/tmp//fc-cache/sample_text.txt > --text=/home/ubuntu/tmp//fc-cache/sample_text.txt > --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache > Rendered page 0 to file /home/ubuntu/tmp//fc-cache/sample_text.txt.tif > > === Phase I: Generating training images === > Rendering using FreeSerif > [Tue Sep 4 03:21:10 UTC 2018] > /home/ubuntu/tesseract/src/training/text2image > --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache --fonts_dir=../.fonts > --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 > --outputbase=/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --max_pages=1 > --font=FreeSerif --text=../langdata_lstm/sin/sin.training_text > Stripped 1 unrenderable words > Rendered page 0 to file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif > > === Phase UP: Generating unicharset and unichar properties files === > [Tue Sep 4 03:21:11 UTC 2018] > /home/ubuntu/tesseract/src/training/unicharset_extractor > --output_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --norm_mode 2 > /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box > Extracting unicharset from box file > /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box > Wrote unicharset file /tmp/sin-2018-09-04.Wa5/sin.unicharset > [Tue Sep 4 03:21:11 UTC 2018] > /home/ubuntu/tesseract/src/training/set_unicharset_properties -U > /tmp/sin-2018-09-04.Wa5/sin.unicharset -O > /tmp/sin-2018-09-04.Wa5/sin.unicharset -X > /tmp/sin-2018-09-04.Wa5/sin.xheights --script_dir=../langdata_lstm > Loaded unicharset of size 111 from file > /tmp/sin-2018-09-04.Wa5/sin.unicharset > Setting unichar properties > Setting script properties > Warning: properties incomplete for index 7 = ි > Warning: properties incomplete for index 9 = ු > Warning: properties incomplete for index 17 = ් > Warning: properties incomplete for index 19 = ී > Warning: properties incomplete for index 38 = ්ර > Warning: properties incomplete for index 66 = ₹ > Warning: properties incomplete for index 73 = ූ > Warning: properties incomplete for index 79 = ්ය > Warning: properties incomplete for index 89 = ක් > Writing unicharset to file /tmp/sin-2018-09-04.Wa5/sin.unicharset > > === Phase E: Generating lstmf files === > Using TESSDATA_PREFIX=../tessdata_best > [Tue Sep 4 03:21:12 UTC 2018] /home/ubuntu/tesseract/src/api/tesseract > /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif > /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --psm 6 lstm.train > Tesseract Open Source OCR Engine v4.0.0-beta.4-93-ge4b9c with Leptonica > Page 1 > > === Constructing LSTM training data === > [Tue Sep 4 03:21:13 UTC 2018] > /home/ubuntu/tesseract/src/training/combine_lang_model --input_unicharset > /tmp/sin-2018-09-04.Wa5/sin.unicharset --script_dir ../langdata_lstm > --words ../langdata_lstm/sin/sin.wordlist --numbers > ../langdata_lstm/sin/sin.numbers --puncs ../langdata_lstm/sin/sin.punc > --output_dir ../tesstutorial/sintest --lang sin --pass_through_recoder > Loaded unicharset of size 111 from file > /tmp/sin-2018-09-04.Wa5/sin.unicharset > Setting unichar properties > Setting script properties > Warning: properties incomplete for index 7 = ි > Warning: properties incomplete for index 9 = ු > Warning: properties incomplete for index 17 = ් > Warning: properties incomplete for index 19 = ී > Warning: properties incomplete for index 38 = ්ර > Warning: properties incomplete for index 66 = ₹ > Warning: properties incomplete for index 73 = ූ > Warning: properties incomplete for index 79 = ්ය > Warning: properties incomplete for index 89 = ක් > Config file is optional, continuing... > Failed to read data from: ../langdata_lstm/sin/sin.config > Reducing Trie to SquishedDawg > Reducing Trie to SquishedDawg > Reducing Trie to SquishedDawg > > === Saving box/tiff pairs for training data === > Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box to > ../tesstutorial/sintest > Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif to > ../tesstutorial/sintest > > === Moving lstmf files for training data === > Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.lstmf to > ../tesstutorial/sintest > > Created starter traineddata for language 'sin' > > > Run lstmtraining to do the LSTM training for language 'sin' > > > real 0m5.238s > user 0m3.792s > sys 0m0.256s > > > On Tue, Sep 4, 2018 at 2:49 AM, Shandigutt <[email protected] > <javascript:>> wrote: > >> Adding more details to my query, >> >> *My tesseract version:* >> tesseract 4.0.0-beta.4-74-gd8237 >> leptonica-1.77.0 >> libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib >> 1.2.11 >> Found SSE >> >> *My OS details,* >> tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ lsb_release -a >> No LSB modules are available. >> Distributor ID: Ubuntu >> Description: Ubuntu 18.04.1 LTS >> Release: 18.04 >> Codename: bionic >> >> Thanks >> >> On Tuesday, September 4, 2018 at 12:11:50 AM UTC+3, Shandigutt wrote: >>> >>> Hi, >>> >>> I'm currently in the process of training Tesseract for new language. I'm >>> currently following Tesseract wiki training guidelines >>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00> >>> . >>> >>> Once I build Tesseract from source and installed, I first created my own >>> langdata set. >>> >>> Then I crated training data and eval data using tesstrain.sh script. >>> >>> Then I tried to create a starter traineddata file >>> using combine_lang_model script. I used the below command for that, >>> >>> *./build/src/training/combine_lang_model --input_unicharset >>> ../training/sintrain/sin/sin.unicharset --script_dir ../langdata --words >>> ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers >>> ../langdata/sin/sin.numbers --output_dir ../training/combined_sin >>> --version_str 1.0 --lang sin* >>> >>> When executing the above command I referred the langdata I created on my >>> own for words list, punctuations and numbers. Also I referred the >>> unicharset file that was created when creating training data. But I got the >>> following error output, >>> >>> *Loaded unicharset of size 90 from file >>> ../training/sintrain/sin/sin.unicharset* >>> *Setting unichar properties* >>> *Setting script properties* >>> *Warning: properties incomplete for index 4 = ී* >>> *Warning: properties incomplete for index 6 = ි* >>> *Warning: properties incomplete for index 11 = ු* >>> *Warning: properties incomplete for index 15 = ්* >>> *Warning: properties incomplete for index 30 = ූ* >>> *Warning: properties incomplete for index 44 = ්ර* >>> *Warning: properties incomplete for index 79 = ්ය* >>> *Warning: properties incomplete for index 82 = ක්* >>> *Warning: properties incomplete for index 89 = ර්* >>> *Error writing unicharset!!* >>> >>> Can somebody assist me on this. >>> >>> Thanks >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e3ec5a61-110b-4d26-b48d-26dae3a5457d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

