Re: [tesseract-ocr] Documentation related to lang data

Shree Devi Kumar Sat, 15 Sep 2018 11:29:50 -0700

cat ~/sin.syllables_text | while IFS="
" read target; do grep -F -m 7 "$target"
~/langdata_lstm/sin/sin.training_text  ; done > tmp.txt
sort -u  tmp.txt  >  ~/sin.sample7.training_text



The above will create a training text with min 7 samples for each line in
syllables_text. Change -m 7 to -m 1 to create file with just one sample of
each. Sort unique removes duplicate lines.

This can be used to create a smaller training_text useful for finetuning.

On Sat, Sep 15, 2018 at 9:23 PM, Shree Devi Kumar <[email protected]>
wrote:

> *desired_characters*
>
> This is used by Google internally when creating the training text.
>
> Should I enter all those compound character combinations to this file?
>
> No, since this is not used by tesstrain.sh - at least in the open source
> version in Github.
>
> *okfonts.txt*
>
> This lists the Unicode fonts used for the LSTM training.
>
> Can I include non Unicode fonts into this file?
>
> NO. Because the rendered text will be incorrect.
>
> *sin.numbers*
>
> This file include all the number characters used in Sinhala.
>
> Unless something is changed in Google's internal training method, this
> should NOT have the number characters. Rather it should have patterns of
> how numbers maybe formatted when used in this language. It may help to look
> at the eng.numbers file for reference.
>
> *sin.punc*
> In lang data this contains punctuation combinations.
>
> Similar to the numbers file, this should have patterns of punctuation
> characters used in the language. Again, refer to eng.punc for reference.
>
>
> *sin.singles_text*
>
> Similar file to wordlist. Contains unique words followed by a new line
>
> In Devanagari it also has unique/rare syllables (compound character
> combinations). Without having the scripts used by Ray (Google) for
> training, it is difficult to say how this is used. I am guessing that these
> are used in a addition to training_text to build the unicharset.
>
> *sin.training_text*
>
> The training_text in langdata_lstm seems to be random words, numbers and
> phrases (based on English and Devanagari). So this maybe based on word
> frequencies in language. While Ray's notes on training say to use text that
> is representative of the language or text to be recognized, the
> training_text does not seem to be full sentences. It's possible that this
> kind of training_text gives better results with LSTM for recognizing
> text/words not seen before. I do not really know.
>
> *sin.unicharset*
>
> This file will be created when creating training data
>
> Yes, please check the sin.lstm-unicharset in the sin.traineddata files to
> check that all required characters are there.
>
> *sin.wordlist*
>
> Contains unique words followed by a new line
>
> This dictionary as well as punc and numbers are used to create dawg files
> which are stored in traineddata files and provide some improvement in
> recognition.
>
> -------------------------
>
> What you could do is create a file with all valid characters and syllables (
> compound character combinations) for Sinhala. Then use this file as input
> and grep the sin.training_text in langdata_lstm to mke sure that all combos
> are included in your training text for fine tuning.
>
>
>
> On Sat, Sep 15, 2018 at 7:43 PM, Shandigutt <[email protected]> wrote:
>
>> Hi,
>>
>> I downloaded latest lstm langdata from tesseract repository. I found it
>> consists of a lot of false data for Sinhala. I'm trying to train tesseract
>> for Sinhala. According to tesseract wiki guidelines, we need to create lang
>> data before creating training data using tesstrain.sh script. I'm
>> referring to the below wiki guidelines,
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>
>> I couldn't find proper wiki guidelines on creating lang data. When I
>> inspected the 'sin' folder in langdata-lstm I found the below files,
>>
>>
>>    - desired_characters
>>    - okfonts.txt
>>    - sin.numbers
>>    - sin.punc
>>    - sin.singles_text
>>    - sin.training_text
>>    - sin.unicharset
>>    - sin.wordlist
>>
>>
>> Please let me know if there's a proper documentation that I can follow if
>> I create these files on my own from the scratch. According to my
>> observations I have the following idea of these files. If there's no any
>> proper documentation of them please correct me if I mention anything wrong
>> here,
>>
>> *desired_characters*
>>
>> This file contains all the unique characters found in the language. Each
>> character followed by new line. My question is Sinhala language has many
>> vowel characters that create compound characters with Sinhala consonants.
>> Unlike English once a vowel character is attached to a consonant it creates
>> a single compound character most of the time which I can erase from a
>> single keyboard backspace. Please refer to the below example,
>>
>> Example 1:
>>
>> Consonant : ද
>>
>> Vowel character : ො
>>
>> Compound character : දො
>>
>> Example 2:
>>
>> Consonent : බ
>>
>> Vowel character : ්
>>
>> Compound character : බ්
>>
>> So each consonant + different vowel characters it makes a lot of compound
>> characters. Should I enter all those compound character combinations to
>> this file?
>>
>>
>> *okfonts.txt*
>>
>> This file includes the fonts I use in my training_text. Format is font
>> name followed by a new line. Can I include non Unicode fonts into this file?
>>
>> *sin.numbers*
>>
>> This file include all the number characters used in Sinhala. Number
>> character followed by a new line. Normally this contains only 10 characters
>>
>> *sin.punc*
>>
>> This character contains all the punctuation characters that can be used
>> in Sinhala text. Format is punctuation character followed by a new line. In
>> lang data this contains punctuation combinations. Please explain why?
>>
>> *sin.singles_text*
>>
>> Similar file to wordlist. Contains unique words followed by a new line
>>
>> *sin.training_text*
>>
>> Training text to be used when creating training data. Should contain
>> around 40000 text lines. Each line can have any amount of characters. It’s
>> better if this document contains text in multiple fonts that we have
>> defined in okfonts.txt. (These fonts can be passed as a command line
>> argument as well)
>>
>> *sin.unicharset*
>>
>> This file will be created when creating training data
>>
>> *sin.wordlist*
>>
>> Contains unique words followed by a new line
>>
>> Appreciate your response on this.
>>
>> Thanks
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/
>> msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%
>> 40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com up
>



-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW%2B-3iiU3%2BcxxSWnAiuk1sBD41vfAvbBdXubQQav17zqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Documentation related to lang data

Reply via email to