Easiest way to check is to use combine_tessdata to unpack the starter
traineddata file and see what is included. You can use dawg2wordlist to
verify that it is the correct files being included.

Yes, after you have the created starter traineddata, you can run
lstmtraining.

On Wed, Sep 5, 2018 at 3:25 AM, Shandigutt <[email protected]> wrote:

> Thank you very much for sorting things out Shree. But I have one more
> question
>
> When I run tesstrain.sh I don't pass my words list, punctuation and
> numbers as input parameters. But I keep those files in the langdata folder.
> So when it executes combine_lang_model internally does it pas these files
> as arguments to combine_lang_model script?
>
> Now since this step is completed can I move straight to running
> lstmtrainingscript?
>
> On Tuesday, September 4, 2018 at 6:25:37 AM UTC+3, shree wrote:
>>
>> > Then I tried to create a starter traineddata file
>> using combine_lang_model script. I used the below command for that,
>>
>> When you run tesstrain.sh, it creates the starter traineddata  using 
>> combine_lang_model
>> script.
>>
>> See below for messages from a small test run.
>>
>> + /home/ubuntu/tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts
>> --lang sin --linedata_only --noextract_font_properties --langdata_dir
>> ../langdata_lstm --tessdata_dir ../tessdata_best --fontlist FreeSerif
>> --training_text ../langdata_lstm/sin/sin.training_text --workspace_dir
>> /home/ubuntu/tmp/ --save_box_tiff --maxpages 1 --output_dir
>> ../tesstutorial/sintest
>>
>> === Starting training for language 'sin'
>> [Tue Sep 4 03:21:08 UTC 2018] /home/ubuntu/tesseract/src/training/text2image
>> --fonts_dir=../.fonts --font=FreeSerif 
>> --outputbase=/home/ubuntu/tmp//fc-cache/sample_text.txt
>> --text=/home/ubuntu/tmp//fc-cache/sample_text.txt
>> --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache
>> Rendered page 0 to file /home/ubuntu/tmp//fc-cache/sample_text.txt.tif
>>
>> === Phase I: Generating training images ===
>> Rendering using FreeSerif
>> [Tue Sep 4 03:21:10 UTC 2018] /home/ubuntu/tesseract/src/training/text2image
>> --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache --fonts_dir=../.fonts
>> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
>> --outputbase=/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --max_pages=1
>> --font=FreeSerif --text=../langdata_lstm/sin/sin.training_text
>> Stripped 1 unrenderable words
>> Rendered page 0 to file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif
>>
>> === Phase UP: Generating unicharset and unichar properties files ===
>> [Tue Sep 4 03:21:11 UTC 2018] 
>> /home/ubuntu/tesseract/src/training/unicharset_extractor
>> --output_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --norm_mode 2
>> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box
>> Extracting unicharset from box file /tmp/sin-2018-09-04.Wa5/sin.Fr
>> eeSerif.exp0.box
>> Wrote unicharset file /tmp/sin-2018-09-04.Wa5/sin.unicharset
>> [Tue Sep 4 03:21:11 UTC 2018] /home/ubuntu/tesseract/src/tra
>> ining/set_unicharset_properties -U /tmp/sin-2018-09-04.Wa5/sin.unicharset
>> -O /tmp/sin-2018-09-04.Wa5/sin.unicharset -X
>> /tmp/sin-2018-09-04.Wa5/sin.xheights --script_dir=../langdata_lstm
>> Loaded unicharset of size 111 from file /tmp/sin-2018-09-04.Wa5/sin.un
>> icharset
>> Setting unichar properties
>> Setting script properties
>> Warning: properties incomplete for index 7 = ි
>> Warning: properties incomplete for index 9 = ු
>> Warning: properties incomplete for index 17 = ්‌
>> Warning: properties incomplete for index 19 = ී
>> Warning: properties incomplete for index 38 = ්‍ර
>> Warning: properties incomplete for index 66 = ₹
>> Warning: properties incomplete for index 73 = ූ
>> Warning: properties incomplete for index 79 = ්‍ය
>> Warning: properties incomplete for index 89 = ක්‍
>> Writing unicharset to file /tmp/sin-2018-09-04.Wa5/sin.unicharset
>>
>> === Phase E: Generating lstmf files ===
>> Using TESSDATA_PREFIX=../tessdata_best
>> [Tue Sep 4 03:21:12 UTC 2018] /home/ubuntu/tesseract/src/api/tesseract
>> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif
>> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --psm 6 lstm.train
>> Tesseract Open Source OCR Engine v4.0.0-beta.4-93-ge4b9c with Leptonica
>> Page 1
>>
>> === Constructing LSTM training data ===
>> [Tue Sep 4 03:21:13 UTC 2018] 
>> /home/ubuntu/tesseract/src/training/combine_lang_model
>> --input_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --script_dir
>> ../langdata_lstm --words ../langdata_lstm/sin/sin.wordlist --numbers
>> ../langdata_lstm/sin/sin.numbers --puncs ../langdata_lstm/sin/sin.punc
>> --output_dir ../tesstutorial/sintest --lang sin --pass_through_recoder
>> Loaded unicharset of size 111 from file /tmp/sin-2018-09-04.Wa5/sin.un
>> icharset
>> Setting unichar properties
>> Setting script properties
>> Warning: properties incomplete for index 7 = ි
>> Warning: properties incomplete for index 9 = ු
>> Warning: properties incomplete for index 17 = ්‌
>> Warning: properties incomplete for index 19 = ී
>> Warning: properties incomplete for index 38 = ්‍ර
>> Warning: properties incomplete for index 66 = ₹
>> Warning: properties incomplete for index 73 = ූ
>> Warning: properties incomplete for index 79 = ්‍ය
>> Warning: properties incomplete for index 89 = ක්‍
>> Config file is optional, continuing...
>> Failed to read data from: ../langdata_lstm/sin/sin.config
>> Reducing Trie to SquishedDawg
>> Reducing Trie to SquishedDawg
>> Reducing Trie to SquishedDawg
>>
>> === Saving box/tiff pairs for training data ===
>> Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box to
>> ../tesstutorial/sintest
>> Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif to
>> ../tesstutorial/sintest
>>
>> === Moving lstmf files for training data ===
>> Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.lstmf to
>> ../tesstutorial/sintest
>>
>> Created starter traineddata for language 'sin'
>>
>>
>> Run lstmtraining to do the LSTM training for language 'sin'
>>
>>
>> real 0m5.238s
>> user 0m3.792s
>> sys 0m0.256s
>>
>>
>> On Tue, Sep 4, 2018 at 2:49 AM, Shandigutt <[email protected]> wrote:
>>
>>> Adding more details to my query,
>>>
>>> *My tesseract  version:*
>>> tesseract 4.0.0-beta.4-74-gd8237
>>>  leptonica-1.77.0
>>>   libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 :
>>> zlib 1.2.11
>>>  Found SSE
>>>
>>> *My OS details,*
>>> tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ lsb_release -a
>>> No LSB modules are available.
>>> Distributor ID: Ubuntu
>>> Description: Ubuntu 18.04.1 LTS
>>> Release: 18.04
>>> Codename: bionic
>>>
>>> Thanks
>>>
>>> On Tuesday, September 4, 2018 at 12:11:50 AM UTC+3, Shandigutt wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm currently in the process of training Tesseract for new language.
>>>> I'm currently following Tesseract wiki training guidelines
>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
>>>> .
>>>>
>>>> Once I build Tesseract from source and installed, I first created my
>>>> own langdata set.
>>>>
>>>> Then I crated training data and eval data using tesstrain.sh script.
>>>>
>>>> Then I tried to create a starter traineddata file
>>>> using combine_lang_model script. I used the below command for that,
>>>>
>>>> *./build/src/training/combine_lang_model --input_unicharset
>>>> ../training/sintrain/sin/sin.unicharset --script_dir ../langdata --words
>>>> ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers
>>>> ../langdata/sin/sin.numbers --output_dir ../training/combined_sin
>>>> --version_str 1.0 --lang sin*
>>>>
>>>> When executing the above command I referred the langdata I created on
>>>> my own for words list, punctuations and numbers. Also I referred the
>>>> unicharset file that was created when creating training data. But I got the
>>>> following error output,
>>>>
>>>> *Loaded unicharset of size 90 from file
>>>> ../training/sintrain/sin/sin.unicharset*
>>>> *Setting unichar properties*
>>>> *Setting script properties*
>>>> *Warning: properties incomplete for index 4 = ී*
>>>> *Warning: properties incomplete for index 6 = ි*
>>>> *Warning: properties incomplete for index 11 = ු*
>>>> *Warning: properties incomplete for index 15 = ්‌*
>>>> *Warning: properties incomplete for index 30 = ූ*
>>>> *Warning: properties incomplete for index 44 = ්‍ර*
>>>> *Warning: properties incomplete for index 79 = ්‍ය*
>>>> *Warning: properties incomplete for index 82 = ක්‍*
>>>> *Warning: properties incomplete for index 89 = ර්‍*
>>>> *Error writing unicharset!!*
>>>>
>>>> Can somebody assist me on this.
>>>>
>>>> Thanks
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e3ec5a61-110b-4d26-b48d-26dae3a5457d%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/e3ec5a61-110b-4d26-b48d-26dae3a5457d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUNdM8hAw%3D08TkPdA68B1YVx-yr-Lkb6JgMwsL4TgRFYA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to