Looks like your langdata dir does not have the script unicharset files for Signals and Latin scripts.
Failed to load script unicharset from:../training/Latin.unicharset Failed to load script unicharset from:../training/Sinhala.unicharset On Sun, 30 Sep 2018, 18:27 Shandigutt, <[email protected]> wrote: > Hi, > > I attempted to create training data using the below command, > > ./src/training/tesstrain.sh --fonts_dir ../Support/font --lang sin > --linedata_only \ > --noextract_font_properties --langdata_dir ../training \ > --tessdata_dir ../tessdata_best --output_dir ../training/sintrain > --fontlist "BhashitaComplex" --training_text > ../training/sin/sin.training_text > > > I could capture only a part of the log output. Highlights are extracted > below, > > Word started with a combiner:0xddc > > Normalization failed for string 'ො' > > Word started with a combiner:0xdca > > Word started with a combiner:0x200d > > Normalization failed for string '්ය' > > Word started with a combiner:0xdcf > > Normalization failed for string 'ා' > > > Wrote unicharset file /tmp/sin-2018-09-29.aN0/sin.unicharset > > [Sat Sep 29 21:33:19 UTC 2018] /usr/local/bin/set_unicharset_properties -U > /tmp/sin-2018-09-29.aN0/sin.unicharset -O > /tmp/sin-2018-09-29.aN0/sin.unicharset -X > /tmp/sin-2018-09-29.aN0/sin.xheights --script_dir=../training > > Loaded unicharset of size 114 from file > /tmp/sin-2018-09-29.aN0/sin.unicharset > > Setting unichar properties > > Setting script properties > > Failed to load script unicharset from:../training/Latin.unicharset > > Failed to load script unicharset from:../training/Sinhala.unicharset > > Warning: properties incomplete for index 3 = ස > > Warning: properties incomplete for index 4 = ී > > Warning: properties incomplete for index 5 = ග > > > === Constructing LSTM training data === > > Creating new directory ../training/sintrain > > [Sun Sep 30 05:32:18 UTC 2018] /usr/local/bin/combine_lang_model > --input_unicharset /tmp/sin-2018-09-29.aN0/sin.unicharset --script_dir > ../training --words ../training/sin/sin.wordlist --numbers > ../training/sin/sin.numbers --puncs ../training/sin/sin.punc --output_dir > ../training/sintrain --lang sin --pass_through_recoder > > Loaded unicharset of size 114 from file > /tmp/sin-2018-09-29.aN0/sin.unicharset > > Setting unichar properties > > Setting script properties > > Failed to load script unicharset from:../training/Latin.unicharset > > Failed to load script unicharset from:../training/Sinhala.unicharset > > Warning: properties incomplete for index 3 = ස > > Warning: properties incomplete for index 4 = ී > > Warning: properties incomplete for index 5 = ග > > > > Warning: properties incomplete for index 112 = ෴ > > Warning: properties incomplete for index 113 = ෲ > > Config file is optional, continuing... > > Failed to read data from: ../training/sin/sin.config > > Failed to read data from: ../training/radical-stroke.txt > > Error reading radical code table ../training/radical-stroke.txt > > > === Moving lstmf files for training data === > > Moving /tmp/sin-2018-09-29.aN0/sin.BhashitaComplex.exp0.lstmf to > ../training/sintrain > > > Created starter traineddata for language 'sin' > > > > Run lstmtraining to do the LSTM training for language 'sin' > > > For the full capture of the log please find the attached file > > Tesseract version I use, > > tesseract --version > > tesseract 4.0.0-beta.4-158-g02f9d > > leptonica-1.77.0 > > libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib > 1.2.11 > > Found AVX512BW > > Found AVX512F > > Found AVX2 > > Found AVX > > Found SSE > > > OS details, > > Linux ip-172-31-13-179 4.15.0-1021-aws #21-Ubuntu SMP Tue Aug 28 10:23:07 > UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > > Please let me know what has gone wrong. > > Thanks > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/590c5444-0006-4816-baf1-35042d443d31%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/590c5444-0006-4816-baf1-35042d443d31%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZDCuP83n-k%3DNPKx14b%2Bu%3DBZFsnN6dXHODuMddc%3D7-KA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

