Thank you very much Shree On Wednesday, September 5, 2018 at 7:11:51 AM UTC+3, shree wrote: > > Easiest way to check is to use combine_tessdata to unpack the starter > traineddata file and see what is included. You can use dawg2wordlist to > verify that it is the correct files being included. > > Yes, after you have the created starter traineddata, you can run > lstmtraining. > > On Wed, Sep 5, 2018 at 3:25 AM, Shandigutt <[email protected] > <javascript:>> wrote: > >> Thank you very much for sorting things out Shree. But I have one more >> question >> >> When I run tesstrain.sh I don't pass my words list, punctuation and >> numbers as input parameters. But I keep those files in the langdata folder. >> So when it executes combine_lang_model internally does it pas these >> files as arguments to combine_lang_model script? >> >> Now since this step is completed can I move straight to running >> lstmtrainingscript? >> >> On Tuesday, September 4, 2018 at 6:25:37 AM UTC+3, shree wrote: >>> >>> > Then I tried to create a starter traineddata file >>> using combine_lang_model script. I used the below command for that, >>> >>> When you run tesstrain.sh, it creates the starter traineddata using >>> combine_lang_model >>> script. >>> >>> See below for messages from a small test run. >>> >>> + /home/ubuntu/tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts >>> --lang sin --linedata_only --noextract_font_properties --langdata_dir >>> ../langdata_lstm --tessdata_dir ../tessdata_best --fontlist FreeSerif >>> --training_text ../langdata_lstm/sin/sin.training_text --workspace_dir >>> /home/ubuntu/tmp/ --save_box_tiff --maxpages 1 --output_dir >>> ../tesstutorial/sintest >>> >>> === Starting training for language 'sin' >>> [Tue Sep 4 03:21:08 UTC 2018] >>> /home/ubuntu/tesseract/src/training/text2image --fonts_dir=../.fonts >>> --font=FreeSerif --outputbase=/home/ubuntu/tmp//fc-cache/sample_text.txt >>> --text=/home/ubuntu/tmp//fc-cache/sample_text.txt >>> --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache >>> Rendered page 0 to file /home/ubuntu/tmp//fc-cache/sample_text.txt.tif >>> >>> === Phase I: Generating training images === >>> Rendering using FreeSerif >>> [Tue Sep 4 03:21:10 UTC 2018] >>> /home/ubuntu/tesseract/src/training/text2image >>> --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache --fonts_dir=../.fonts >>> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 >>> --outputbase=/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --max_pages=1 >>> --font=FreeSerif --text=../langdata_lstm/sin/sin.training_text >>> Stripped 1 unrenderable words >>> Rendered page 0 to file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif >>> >>> === Phase UP: Generating unicharset and unichar properties files === >>> [Tue Sep 4 03:21:11 UTC 2018] >>> /home/ubuntu/tesseract/src/training/unicharset_extractor >>> --output_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --norm_mode 2 >>> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box >>> Extracting unicharset from box file >>> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box >>> Wrote unicharset file /tmp/sin-2018-09-04.Wa5/sin.unicharset >>> [Tue Sep 4 03:21:11 UTC 2018] >>> /home/ubuntu/tesseract/src/training/set_unicharset_properties -U >>> /tmp/sin-2018-09-04.Wa5/sin.unicharset -O >>> /tmp/sin-2018-09-04.Wa5/sin.unicharset -X >>> /tmp/sin-2018-09-04.Wa5/sin.xheights --script_dir=../langdata_lstm >>> Loaded unicharset of size 111 from file >>> /tmp/sin-2018-09-04.Wa5/sin.unicharset >>> Setting unichar properties >>> Setting script properties >>> Warning: properties incomplete for index 7 = ි >>> Warning: properties incomplete for index 9 = ු >>> Warning: properties incomplete for index 17 = ් >>> Warning: properties incomplete for index 19 = ී >>> Warning: properties incomplete for index 38 = ්ර >>> Warning: properties incomplete for index 66 = ₹ >>> Warning: properties incomplete for index 73 = ූ >>> Warning: properties incomplete for index 79 = ්ය >>> Warning: properties incomplete for index 89 = ක් >>> Writing unicharset to file /tmp/sin-2018-09-04.Wa5/sin.unicharset >>> >>> === Phase E: Generating lstmf files === >>> Using TESSDATA_PREFIX=../tessdata_best >>> [Tue Sep 4 03:21:12 UTC 2018] /home/ubuntu/tesseract/src/api/tesseract >>> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif >>> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --psm 6 lstm.train >>> Tesseract Open Source OCR Engine v4.0.0-beta.4-93-ge4b9c with Leptonica >>> Page 1 >>> >>> === Constructing LSTM training data === >>> [Tue Sep 4 03:21:13 UTC 2018] >>> /home/ubuntu/tesseract/src/training/combine_lang_model --input_unicharset >>> /tmp/sin-2018-09-04.Wa5/sin.unicharset --script_dir ../langdata_lstm >>> --words ../langdata_lstm/sin/sin.wordlist --numbers >>> ../langdata_lstm/sin/sin.numbers --puncs ../langdata_lstm/sin/sin.punc >>> --output_dir ../tesstutorial/sintest --lang sin --pass_through_recoder >>> Loaded unicharset of size 111 from file >>> /tmp/sin-2018-09-04.Wa5/sin.unicharset >>> Setting unichar properties >>> Setting script properties >>> Warning: properties incomplete for index 7 = ි >>> Warning: properties incomplete for index 9 = ු >>> Warning: properties incomplete for index 17 = ් >>> Warning: properties incomplete for index 19 = ී >>> Warning: properties incomplete for index 38 = ්ර >>> Warning: properties incomplete for index 66 = ₹ >>> Warning: properties incomplete for index 73 = ූ >>> Warning: properties incomplete for index 79 = ්ය >>> Warning: properties incomplete for index 89 = ක් >>> Config file is optional, continuing... >>> Failed to read data from: ../langdata_lstm/sin/sin.config >>> Reducing Trie to SquishedDawg >>> Reducing Trie to SquishedDawg >>> Reducing Trie to SquishedDawg >>> >>> === Saving box/tiff pairs for training data === >>> Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box to >>> ../tesstutorial/sintest >>> Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif to >>> ../tesstutorial/sintest >>> >>> === Moving lstmf files for training data === >>> Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.lstmf to >>> ../tesstutorial/sintest >>> >>> Created starter traineddata for language 'sin' >>> >>> >>> Run lstmtraining to do the LSTM training for language 'sin' >>> >>> >>> real 0m5.238s >>> user 0m3.792s >>> sys 0m0.256s >>> >>> >>> On Tue, Sep 4, 2018 at 2:49 AM, Shandigutt <[email protected]> wrote: >>> >>>> Adding more details to my query, >>>> >>>> *My tesseract version:* >>>> tesseract 4.0.0-beta.4-74-gd8237 >>>> leptonica-1.77.0 >>>> libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : >>>> zlib 1.2.11 >>>> Found SSE >>>> >>>> *My OS details,* >>>> tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ lsb_release -a >>>> No LSB modules are available. >>>> Distributor ID: Ubuntu >>>> Description: Ubuntu 18.04.1 LTS >>>> Release: 18.04 >>>> Codename: bionic >>>> >>>> Thanks >>>> >>>> On Tuesday, September 4, 2018 at 12:11:50 AM UTC+3, Shandigutt wrote: >>>>> >>>>> Hi, >>>>> >>>>> I'm currently in the process of training Tesseract for new language. >>>>> I'm currently following Tesseract wiki training guidelines >>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00> >>>>> . >>>>> >>>>> Once I build Tesseract from source and installed, I first created my >>>>> own langdata set. >>>>> >>>>> Then I crated training data and eval data using tesstrain.sh script. >>>>> >>>>> Then I tried to create a starter traineddata file >>>>> using combine_lang_model script. I used the below command for that, >>>>> >>>>> *./build/src/training/combine_lang_model --input_unicharset >>>>> ../training/sintrain/sin/sin.unicharset --script_dir ../langdata --words >>>>> ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers >>>>> ../langdata/sin/sin.numbers --output_dir ../training/combined_sin >>>>> --version_str 1.0 --lang sin* >>>>> >>>>> When executing the above command I referred the langdata I created on >>>>> my own for words list, punctuations and numbers. Also I referred the >>>>> unicharset file that was created when creating training data. But I got >>>>> the >>>>> following error output, >>>>> >>>>> *Loaded unicharset of size 90 from file >>>>> ../training/sintrain/sin/sin.unicharset* >>>>> *Setting unichar properties* >>>>> *Setting script properties* >>>>> *Warning: properties incomplete for index 4 = ී* >>>>> *Warning: properties incomplete for index 6 = ි* >>>>> *Warning: properties incomplete for index 11 = ු* >>>>> *Warning: properties incomplete for index 15 = ්* >>>>> *Warning: properties incomplete for index 30 = ූ* >>>>> *Warning: properties incomplete for index 44 = ්ර* >>>>> *Warning: properties incomplete for index 79 = ්ය* >>>>> *Warning: properties incomplete for index 82 = ක්* >>>>> *Warning: properties incomplete for index 89 = ර්* >>>>> *Error writing unicharset!!* >>>>> >>>>> Can somebody assist me on this. >>>>> >>>>> Thanks >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/e3ec5a61-110b-4d26-b48d-26dae3a5457d%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/e3ec5a61-110b-4d26-b48d-26dae3a5457d%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d6bf230f-33e0-4d2f-acc7-666fa68a1c4c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

