Re: [tesseract-ocr] Re: Error when executing combine_lang_model script

Shandigutt Sat, 08 Sep 2018 06:39:26 -0700

Thank you very much Shree

On Wednesday, September 5, 2018 at 7:11:51 AM UTC+3, shree wrote:
>
> Easiest way to check is to use combine_tessdata to unpack the starter 
> traineddata file and see what is included. You can use dawg2wordlist to 
> verify that it is the correct files being included.
>
> Yes, after you have the created starter traineddata, you can run 
> lstmtraining.
>
> On Wed, Sep 5, 2018 at 3:25 AM, Shandigutt <[email protected] 
> <javascript:>> wrote:
>
>> Thank you very much for sorting things out Shree. But I have one more 
>> question
>>
>> When I run tesstrain.sh I don't pass my words list, punctuation and 
>> numbers as input parameters. But I keep those files in the langdata folder. 
>> So when it executes combine_lang_model internally does it pas these 
>> files as arguments to combine_lang_model script?
>>
>> Now since this step is completed can I move straight to running 
>> lstmtrainingscript?
>>
>> On Tuesday, September 4, 2018 at 6:25:37 AM UTC+3, shree wrote:
>>>
>>> > Then I tried to create a starter traineddata file 
>>> using combine_lang_model script. I used the below command for that, 
>>>
>>> When you run tesstrain.sh, it creates the starter traineddata  using 
>>> combine_lang_model 
>>> script.  
>>>
>>> See below for messages from a small test run.
>>>
>>> + /home/ubuntu/tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts 
>>> --lang sin --linedata_only --noextract_font_properties --langdata_dir 
>>> ../langdata_lstm --tessdata_dir ../tessdata_best --fontlist FreeSerif 
>>> --training_text ../langdata_lstm/sin/sin.training_text --workspace_dir 
>>> /home/ubuntu/tmp/ --save_box_tiff --maxpages 1 --output_dir 
>>> ../tesstutorial/sintest
>>>
>>> === Starting training for language 'sin'
>>> [Tue Sep 4 03:21:08 UTC 2018] 
>>> /home/ubuntu/tesseract/src/training/text2image --fonts_dir=../.fonts 
>>> --font=FreeSerif --outputbase=/home/ubuntu/tmp//fc-cache/sample_text.txt 
>>> --text=/home/ubuntu/tmp//fc-cache/sample_text.txt 
>>> --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache
>>> Rendered page 0 to file /home/ubuntu/tmp//fc-cache/sample_text.txt.tif
>>>
>>> === Phase I: Generating training images ===
>>> Rendering using FreeSerif
>>> [Tue Sep 4 03:21:10 UTC 2018] 
>>> /home/ubuntu/tesseract/src/training/text2image 
>>> --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache --fonts_dir=../.fonts 
>>> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 
>>> --outputbase=/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --max_pages=1 
>>> --font=FreeSerif --text=../langdata_lstm/sin/sin.training_text
>>> Stripped 1 unrenderable words
>>> Rendered page 0 to file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif
>>>
>>> === Phase UP: Generating unicharset and unichar properties files ===
>>> [Tue Sep 4 03:21:11 UTC 2018] 
>>> /home/ubuntu/tesseract/src/training/unicharset_extractor 
>>> --output_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --norm_mode 2 
>>> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box
>>> Extracting unicharset from box file 
>>> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box
>>> Wrote unicharset file /tmp/sin-2018-09-04.Wa5/sin.unicharset
>>> [Tue Sep 4 03:21:11 UTC 2018] 
>>> /home/ubuntu/tesseract/src/training/set_unicharset_properties -U 
>>> /tmp/sin-2018-09-04.Wa5/sin.unicharset -O 
>>> /tmp/sin-2018-09-04.Wa5/sin.unicharset -X 
>>> /tmp/sin-2018-09-04.Wa5/sin.xheights --script_dir=../langdata_lstm
>>> Loaded unicharset of size 111 from file 
>>> /tmp/sin-2018-09-04.Wa5/sin.unicharset
>>> Setting unichar properties
>>> Setting script properties
>>> Warning: properties incomplete for index 7 = ි
>>> Warning: properties incomplete for index 9 = ු
>>> Warning: properties incomplete for index 17 = ්‌
>>> Warning: properties incomplete for index 19 = ී
>>> Warning: properties incomplete for index 38 = ්‍ර
>>> Warning: properties incomplete for index 66 = ₹
>>> Warning: properties incomplete for index 73 = ූ
>>> Warning: properties incomplete for index 79 = ්‍ය
>>> Warning: properties incomplete for index 89 = ක්‍
>>> Writing unicharset to file /tmp/sin-2018-09-04.Wa5/sin.unicharset
>>>
>>> === Phase E: Generating lstmf files ===
>>> Using TESSDATA_PREFIX=../tessdata_best
>>> [Tue Sep 4 03:21:12 UTC 2018] /home/ubuntu/tesseract/src/api/tesseract 
>>> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif 
>>> /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --psm 6 lstm.train
>>> Tesseract Open Source OCR Engine v4.0.0-beta.4-93-ge4b9c with Leptonica
>>> Page 1
>>>
>>> === Constructing LSTM training data ===
>>> [Tue Sep 4 03:21:13 UTC 2018] 
>>> /home/ubuntu/tesseract/src/training/combine_lang_model --input_unicharset 
>>> /tmp/sin-2018-09-04.Wa5/sin.unicharset --script_dir ../langdata_lstm 
>>> --words ../langdata_lstm/sin/sin.wordlist --numbers 
>>> ../langdata_lstm/sin/sin.numbers --puncs ../langdata_lstm/sin/sin.punc 
>>> --output_dir ../tesstutorial/sintest --lang sin --pass_through_recoder
>>> Loaded unicharset of size 111 from file 
>>> /tmp/sin-2018-09-04.Wa5/sin.unicharset
>>> Setting unichar properties
>>> Setting script properties
>>> Warning: properties incomplete for index 7 = ි
>>> Warning: properties incomplete for index 9 = ු
>>> Warning: properties incomplete for index 17 = ්‌
>>> Warning: properties incomplete for index 19 = ී
>>> Warning: properties incomplete for index 38 = ්‍ර
>>> Warning: properties incomplete for index 66 = ₹
>>> Warning: properties incomplete for index 73 = ූ
>>> Warning: properties incomplete for index 79 = ්‍ය
>>> Warning: properties incomplete for index 89 = ක්‍
>>> Config file is optional, continuing...
>>> Failed to read data from: ../langdata_lstm/sin/sin.config
>>> Reducing Trie to SquishedDawg
>>> Reducing Trie to SquishedDawg
>>> Reducing Trie to SquishedDawg
>>>
>>> === Saving box/tiff pairs for training data ===
>>> Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box to 
>>> ../tesstutorial/sintest
>>> Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif to 
>>> ../tesstutorial/sintest
>>>
>>> === Moving lstmf files for training data ===
>>> Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.lstmf to 
>>> ../tesstutorial/sintest
>>>
>>> Created starter traineddata for language 'sin'
>>>
>>>
>>> Run lstmtraining to do the LSTM training for language 'sin'
>>>
>>>
>>> real 0m5.238s
>>> user 0m3.792s
>>> sys 0m0.256s
>>>
>>>
>>> On Tue, Sep 4, 2018 at 2:49 AM, Shandigutt <[email protected]> wrote:
>>>
>>>> Adding more details to my query,
>>>>
>>>> *My tesseract  version:*
>>>> tesseract 4.0.0-beta.4-74-gd8237
>>>>  leptonica-1.77.0
>>>>   libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : 
>>>> zlib 1.2.11
>>>>  Found SSE
>>>>
>>>> *My OS details,*
>>>> tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ lsb_release -a
>>>> No LSB modules are available.
>>>> Distributor ID: Ubuntu
>>>> Description: Ubuntu 18.04.1 LTS
>>>> Release: 18.04
>>>> Codename: bionic
>>>>
>>>> Thanks
>>>>
>>>> On Tuesday, September 4, 2018 at 12:11:50 AM UTC+3, Shandigutt wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm currently in the process of training Tesseract for new language. 
>>>>> I'm currently following Tesseract wiki training guidelines 
>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
>>>>> .
>>>>>
>>>>> Once I build Tesseract from source and installed, I first created my 
>>>>> own langdata set. 
>>>>>
>>>>> Then I crated training data and eval data using tesstrain.sh script.
>>>>>
>>>>> Then I tried to create a starter traineddata file 
>>>>> using combine_lang_model script. I used the below command for that,
>>>>>
>>>>> *./build/src/training/combine_lang_model --input_unicharset 
>>>>> ../training/sintrain/sin/sin.unicharset --script_dir ../langdata --words 
>>>>> ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers 
>>>>> ../langdata/sin/sin.numbers --output_dir ../training/combined_sin 
>>>>> --version_str 1.0 --lang sin*
>>>>>
>>>>> When executing the above command I referred the langdata I created on 
>>>>> my own for words list, punctuations and numbers. Also I referred the 
>>>>> unicharset file that was created when creating training data. But I got 
>>>>> the 
>>>>> following error output,
>>>>>
>>>>> *Loaded unicharset of size 90 from file 
>>>>> ../training/sintrain/sin/sin.unicharset*
>>>>> *Setting unichar properties*
>>>>> *Setting script properties*
>>>>> *Warning: properties incomplete for index 4 = ී*
>>>>> *Warning: properties incomplete for index 6 = ි*
>>>>> *Warning: properties incomplete for index 11 = ු*
>>>>> *Warning: properties incomplete for index 15 = ්‌*
>>>>> *Warning: properties incomplete for index 30 = ූ*
>>>>> *Warning: properties incomplete for index 44 = ්‍ර*
>>>>> *Warning: properties incomplete for index 79 = ්‍ය*
>>>>> *Warning: properties incomplete for index 82 = ක්‍*
>>>>> *Warning: properties incomplete for index 89 = ර්‍*
>>>>> *Error writing unicharset!!*
>>>>>
>>>>> Can somebody assist me on this.
>>>>>
>>>>> Thanks
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> -- 
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/e3ec5a61-110b-4d26-b48d-26dae3a5457d%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/e3ec5a61-110b-4d26-b48d-26dae3a5457d%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d6bf230f-33e0-4d2f-acc7-666fa68a1c4c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Error when executing combine_lang_model script

Reply via email to