Re: [tesseract-ocr] Re: Change unicharset

2018-04-12 Thread ShreeDevi Kumar
1. concatenate the two training texts

cat ./langdata/kor/kor.training_text
./langdata/chi_tra/chi_tra.training_text >
./langdata/kor/kor-chi_tra.training_text


2. run tesstrain.sh with (update for your paths, run with just one font
which supports both languages as a test)

$tesstrain_dir/tesstrain.sh \
   --lang kor \
   --linedata_only\
   --noextract_font_properties \
   --exposures "0" \
   --fonts_dir /usr/share/fonts/ \
   --fontlist "Arial" \
   --langdata_dir ./langdata \
   --tessdata_dir  ./tessdata_best \
   --training_text  ./langdata/kor/kor-chi_tra.training_text \
   --output_dir $train_output_dir

3.  Check the unicharset in the generated starter traineddata

 $train_output_dir/kor/kor.unicharset

This should have unichars from both languages.

4.   cat ./langdata/kor/kor.wordlist ./langdata/chi_tra/chi_tra.wordlist >
./langdata/kor/kor-chi_tra.wordlist

5.  combine_tessdata -e  ./tessdata_best/kor.traineddata
 $train_output_dir/kor.lstm

etc



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Apr 13, 2018 at 10:35 AM, Fanatico  wrote:

> And if I look at the "kor.unicharset" created after executing
> "training/tesstrain.sh" it only contains the korean characters, even after
> I changing "kor.lstm-unicharset" from the "kor.traineddata"
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5b7a5744-52fb-49fb-a0ec-555e0827d61c%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVcitJKxdTCZ9c%2BmCCuM4ua2rNwAVnAREoWwYkMx9MNFQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Change unicharset

2018-04-12 Thread Fanatico
And if I look at the "kor.unicharset" created after executing 
"training/tesstrain.sh" it only contains the korean characters, even after 
I changing "kor.lstm-unicharset" from the "kor.traineddata"

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5b7a5744-52fb-49fb-a0ec-555e0827d61c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.