> Then I tried to create a starter traineddata file
using combine_lang_model script. I used the below command for that,

When you run tesstrain.sh, it creates the starter traineddata  using
combine_lang_model
script.

See below for messages from a small test run.

+ /home/ubuntu/tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts
--lang sin --linedata_only --noextract_font_properties --langdata_dir
../langdata_lstm --tessdata_dir ../tessdata_best --fontlist FreeSerif
--training_text ../langdata_lstm/sin/sin.training_text --workspace_dir
/home/ubuntu/tmp/ --save_box_tiff --maxpages 1 --output_dir
../tesstutorial/sintest

=== Starting training for language 'sin'
[Tue Sep 4 03:21:08 UTC 2018]
/home/ubuntu/tesseract/src/training/text2image --fonts_dir=../.fonts
--font=FreeSerif --outputbase=/home/ubuntu/tmp//fc-cache/sample_text.txt
--text=/home/ubuntu/tmp//fc-cache/sample_text.txt
--fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache
Rendered page 0 to file /home/ubuntu/tmp//fc-cache/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using FreeSerif
[Tue Sep 4 03:21:10 UTC 2018]
/home/ubuntu/tesseract/src/training/text2image
--fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache --fonts_dir=../.fonts
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
--outputbase=/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --max_pages=1
--font=FreeSerif --text=../langdata_lstm/sin/sin.training_text
Stripped 1 unrenderable words
Rendered page 0 to file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Tue Sep 4 03:21:11 UTC 2018]
/home/ubuntu/tesseract/src/training/unicharset_extractor
--output_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --norm_mode 2
/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box
Extracting unicharset from box file
/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box
Wrote unicharset file /tmp/sin-2018-09-04.Wa5/sin.unicharset
[Tue Sep 4 03:21:11 UTC 2018]
/home/ubuntu/tesseract/src/training/set_unicharset_properties -U
/tmp/sin-2018-09-04.Wa5/sin.unicharset -O
/tmp/sin-2018-09-04.Wa5/sin.unicharset -X
/tmp/sin-2018-09-04.Wa5/sin.xheights --script_dir=../langdata_lstm
Loaded unicharset of size 111 from file
/tmp/sin-2018-09-04.Wa5/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 7 = ි
Warning: properties incomplete for index 9 = ු
Warning: properties incomplete for index 17 = ්‌
Warning: properties incomplete for index 19 = ී
Warning: properties incomplete for index 38 = ්‍ර
Warning: properties incomplete for index 66 = ₹
Warning: properties incomplete for index 73 = ූ
Warning: properties incomplete for index 79 = ්‍ය
Warning: properties incomplete for index 89 = ක්‍
Writing unicharset to file /tmp/sin-2018-09-04.Wa5/sin.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=../tessdata_best
[Tue Sep 4 03:21:12 UTC 2018] /home/ubuntu/tesseract/src/api/tesseract
/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif
/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.4-93-ge4b9c with Leptonica
Page 1

=== Constructing LSTM training data ===
[Tue Sep 4 03:21:13 UTC 2018]
/home/ubuntu/tesseract/src/training/combine_lang_model --input_unicharset
/tmp/sin-2018-09-04.Wa5/sin.unicharset --script_dir ../langdata_lstm
--words ../langdata_lstm/sin/sin.wordlist --numbers
../langdata_lstm/sin/sin.numbers --puncs ../langdata_lstm/sin/sin.punc
--output_dir ../tesstutorial/sintest --lang sin --pass_through_recoder
Loaded unicharset of size 111 from file
/tmp/sin-2018-09-04.Wa5/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 7 = ි
Warning: properties incomplete for index 9 = ු
Warning: properties incomplete for index 17 = ්‌
Warning: properties incomplete for index 19 = ී
Warning: properties incomplete for index 38 = ්‍ර
Warning: properties incomplete for index 66 = ₹
Warning: properties incomplete for index 73 = ූ
Warning: properties incomplete for index 79 = ්‍ය
Warning: properties incomplete for index 89 = ක්‍
Config file is optional, continuing...
Failed to read data from: ../langdata_lstm/sin/sin.config
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg

=== Saving box/tiff pairs for training data ===
Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box to
../tesstutorial/sintest
Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif to
../tesstutorial/sintest

=== Moving lstmf files for training data ===
Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.lstmf to
../tesstutorial/sintest

Created starter traineddata for language 'sin'


Run lstmtraining to do the LSTM training for language 'sin'


real 0m5.238s
user 0m3.792s
sys 0m0.256s


On Tue, Sep 4, 2018 at 2:49 AM, Shandigutt <[email protected]> wrote:

> Adding more details to my query,
>
> *My tesseract  version:*
> tesseract 4.0.0-beta.4-74-gd8237
>  leptonica-1.77.0
>   libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib
> 1.2.11
>  Found SSE
>
> *My OS details,*
> tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description: Ubuntu 18.04.1 LTS
> Release: 18.04
> Codename: bionic
>
> Thanks
>
> On Tuesday, September 4, 2018 at 12:11:50 AM UTC+3, Shandigutt wrote:
>>
>> Hi,
>>
>> I'm currently in the process of training Tesseract for new language. I'm
>> currently following Tesseract wiki training guidelines
>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>.
>>
>> Once I build Tesseract from source and installed, I first created my own
>> langdata set.
>>
>> Then I crated training data and eval data using tesstrain.sh script.
>>
>> Then I tried to create a starter traineddata file
>> using combine_lang_model script. I used the below command for that,
>>
>> *./build/src/training/combine_lang_model --input_unicharset
>> ../training/sintrain/sin/sin.unicharset --script_dir ../langdata --words
>> ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers
>> ../langdata/sin/sin.numbers --output_dir ../training/combined_sin
>> --version_str 1.0 --lang sin*
>>
>> When executing the above command I referred the langdata I created on my
>> own for words list, punctuations and numbers. Also I referred the
>> unicharset file that was created when creating training data. But I got the
>> following error output,
>>
>> *Loaded unicharset of size 90 from file
>> ../training/sintrain/sin/sin.unicharset*
>> *Setting unichar properties*
>> *Setting script properties*
>> *Warning: properties incomplete for index 4 = ී*
>> *Warning: properties incomplete for index 6 = ි*
>> *Warning: properties incomplete for index 11 = ු*
>> *Warning: properties incomplete for index 15 = ්‌*
>> *Warning: properties incomplete for index 30 = ූ*
>> *Warning: properties incomplete for index 44 = ්‍ර*
>> *Warning: properties incomplete for index 79 = ්‍ය*
>> *Warning: properties incomplete for index 82 = ක්‍*
>> *Warning: properties incomplete for index 89 = ර්‍*
>> *Error writing unicharset!!*
>>
>> Can somebody assist me on this.
>>
>> Thanks
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU6zRnZU_GeeOs7JO0JRi%2BbSifYbHZrX3NR1FsGOacgTQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to