Re: [tesseract-ocr] Re: Error when executing combine_lang_model script

Shandigutt Tue, 04 Sep 2018 14:55:42 -0700

Thank you very much for sorting things out Shree. But I have one more 
question


When I run tesstrain.sh I don't pass my words list, punctuation and numbers 
as input parameters. But I keep those files in the langdata folder. So when 
it executes combine_lang_model internally does it pas these files as 
arguments to combine_lang_model script?

Now since this step is completed can I move straight to running lstmtraining
script?

On Tuesday, September 4, 2018 at 6:25:37 AM UTC+3, shree wrote:
>
> > Then I tried to create a starter traineddata file 
> using combine_lang_model script. I used the below command for that, 
>
> When you run tesstrain.sh, it creates the starter traineddata  using 
> combine_lang_model 
> script.  
>
> See below for messages from a small test run.
>
> + /home/ubuntu/tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts 
> --lang sin --linedata_only --noextract_font_properties --langdata_dir 
> ../langdata_lstm --tessdata_dir ../tessdata_best --fontlist FreeSerif 
> --training_text ../langdata_lstm/sin/sin.training_text --workspace_dir 
> /home/ubuntu/tmp/ --save_box_tiff --maxpages 1 --output_dir 
> ../tesstutorial/sintest
>
> === Starting training for language 'sin'
> [Tue Sep 4 03:21:08 UTC 2018] 
> /home/ubuntu/tesseract/src/training/text2image --fonts_dir=../.fonts 
> --font=FreeSerif --outputbase=/home/ubuntu/tmp//fc-cache/sample_text.txt 
> --text=/home/ubuntu/tmp//fc-cache/sample_text.txt 
> --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache
> Rendered page 0 to file /home/ubuntu/tmp//fc-cache/sample_text.txt.tif
>
> === Phase I: Generating training images ===
> Rendering using FreeSerif
> [Tue Sep 4 03:21:10 UTC 2018] 
> /home/ubuntu/tesseract/src/training/text2image 
> --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache --fonts_dir=../.fonts 
> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 
> --outputbase=/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --max_pages=1 
> --font=FreeSerif --text=../langdata_lstm/sin/sin.training_text
> Stripped 1 unrenderable words
> Rendered page 0 to file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif
>
> === Phase UP: Generating unicharset and unichar properties files ===
> [Tue Sep 4 03:21:11 UTC 2018] 
> /home/ubuntu/tesseract/src/training/unicharset_extractor 
> --output_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --norm_mode 2 
> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box
> Extracting unicharset from box file 
> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box
> Wrote unicharset file /tmp/sin-2018-09-04.Wa5/sin.unicharset
> [Tue Sep 4 03:21:11 UTC 2018] 
> /home/ubuntu/tesseract/src/training/set_unicharset_properties -U 
> /tmp/sin-2018-09-04.Wa5/sin.unicharset -O 
> /tmp/sin-2018-09-04.Wa5/sin.unicharset -X 
> /tmp/sin-2018-09-04.Wa5/sin.xheights --script_dir=../langdata_lstm
> Loaded unicharset of size 111 from file 
> /tmp/sin-2018-09-04.Wa5/sin.unicharset
> Setting unichar properties
> Setting script properties
> Warning: properties incomplete for index 7 = ි
> Warning: properties incomplete for index 9 = ු
> Warning: properties incomplete for index 17 = ්‌
> Warning: properties incomplete for index 19 = ී
> Warning: properties incomplete for index 38 = ්‍ර
> Warning: properties incomplete for index 66 = ₹
> Warning: properties incomplete for index 73 = ූ
> Warning: properties incomplete for index 79 = ්‍ය
> Warning: properties incomplete for index 89 = ක්‍
> Writing unicharset to file /tmp/sin-2018-09-04.Wa5/sin.unicharset
>
> === Phase E: Generating lstmf files ===
> Using TESSDATA_PREFIX=../tessdata_best
> [Tue Sep 4 03:21:12 UTC 2018] /home/ubuntu/tesseract/src/api/tesseract 
> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif 
> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --psm 6 lstm.train
> Tesseract Open Source OCR Engine v4.0.0-beta.4-93-ge4b9c with Leptonica
> Page 1
>
> === Constructing LSTM training data ===
> [Tue Sep 4 03:21:13 UTC 2018] 
> /home/ubuntu/tesseract/src/training/combine_lang_model --input_unicharset 
> /tmp/sin-2018-09-04.Wa5/sin.unicharset --script_dir ../langdata_lstm 
> --words ../langdata_lstm/sin/sin.wordlist --numbers 
> ../langdata_lstm/sin/sin.numbers --puncs ../langdata_lstm/sin/sin.punc 
> --output_dir ../tesstutorial/sintest --lang sin --pass_through_recoder
> Loaded unicharset of size 111 from file 
> /tmp/sin-2018-09-04.Wa5/sin.unicharset
> Setting unichar properties
> Setting script properties
> Warning: properties incomplete for index 7 = ි
> Warning: properties incomplete for index 9 = ු
> Warning: properties incomplete for index 17 = ්‌
> Warning: properties incomplete for index 19 = ී
> Warning: properties incomplete for index 38 = ්‍ර
> Warning: properties incomplete for index 66 = ₹
> Warning: properties incomplete for index 73 = ූ
> Warning: properties incomplete for index 79 = ්‍ය
> Warning: properties incomplete for index 89 = ක්‍
> Config file is optional, continuing...
> Failed to read data from: ../langdata_lstm/sin/sin.config
> Reducing Trie to SquishedDawg
> Reducing Trie to SquishedDawg
> Reducing Trie to SquishedDawg
>
> === Saving box/tiff pairs for training data ===
> Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box to 
> ../tesstutorial/sintest
> Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif to 
> ../tesstutorial/sintest
>
> === Moving lstmf files for training data ===
> Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.lstmf to 
> ../tesstutorial/sintest
>
> Created starter traineddata for language 'sin'
>
>
> Run lstmtraining to do the LSTM training for language 'sin'
>
>
> real 0m5.238s
> user 0m3.792s
> sys 0m0.256s
>
>
> On Tue, Sep 4, 2018 at 2:49 AM, Shandigutt <[email protected] 
> <javascript:>> wrote:
>
>> Adding more details to my query,
>>
>> *My tesseract  version:*
>> tesseract 4.0.0-beta.4-74-gd8237
>>  leptonica-1.77.0
>>   libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 
>> 1.2.11
>>  Found SSE
>>
>> *My OS details,*
>> tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ lsb_release -a
>> No LSB modules are available.
>> Distributor ID: Ubuntu
>> Description: Ubuntu 18.04.1 LTS
>> Release: 18.04
>> Codename: bionic
>>
>> Thanks
>>
>> On Tuesday, September 4, 2018 at 12:11:50 AM UTC+3, Shandigutt wrote:
>>>
>>> Hi,
>>>
>>> I'm currently in the process of training Tesseract for new language. I'm 
>>> currently following Tesseract wiki training guidelines 
>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
>>> .
>>>
>>> Once I build Tesseract from source and installed, I first created my own 
>>> langdata set. 
>>>
>>> Then I crated training data and eval data using tesstrain.sh script.
>>>
>>> Then I tried to create a starter traineddata file 
>>> using combine_lang_model script. I used the below command for that,
>>>
>>> *./build/src/training/combine_lang_model --input_unicharset 
>>> ../training/sintrain/sin/sin.unicharset --script_dir ../langdata --words 
>>> ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers 
>>> ../langdata/sin/sin.numbers --output_dir ../training/combined_sin 
>>> --version_str 1.0 --lang sin*
>>>
>>> When executing the above command I referred the langdata I created on my 
>>> own for words list, punctuations and numbers. Also I referred the 
>>> unicharset file that was created when creating training data. But I got the 
>>> following error output,
>>>
>>> *Loaded unicharset of size 90 from file 
>>> ../training/sintrain/sin/sin.unicharset*
>>> *Setting unichar properties*
>>> *Setting script properties*
>>> *Warning: properties incomplete for index 4 = ී*
>>> *Warning: properties incomplete for index 6 = ි*
>>> *Warning: properties incomplete for index 11 = ු*
>>> *Warning: properties incomplete for index 15 = ්‌*
>>> *Warning: properties incomplete for index 30 = ූ*
>>> *Warning: properties incomplete for index 44 = ්‍ර*
>>> *Warning: properties incomplete for index 79 = ්‍ය*
>>> *Warning: properties incomplete for index 82 = ක්‍*
>>> *Warning: properties incomplete for index 89 = ර්‍*
>>> *Error writing unicharset!!*
>>>
>>> Can somebody assist me on this.
>>>
>>> Thanks
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e3ec5a61-110b-4d26-b48d-26dae3a5457d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Error when executing combine_lang_model script

Reply via email to