Re: [tesseract-ocr] Documentation related to lang data

Shandigutt Sat, 15 Sep 2018 14:27:51 -0700

Thank you very much for your kind help 

On Saturday, September 15, 2018 at 9:29:38 PM UTC+3, shree wrote:
>
>
> cat ~/sin.syllables_text | while IFS="
> " read target; do grep -F -m 7 "$target"  
> ~/langdata_lstm/sin/sin.training_text  ; done > tmp.txt
> sort -u  tmp.txt  >  ~/sin.sample7.training_text
>
>
> The above will create a training text with min 7 samples for each line in 
> syllables_text. Change -m 7 to -m 1 to create file with just one sample of 
> each. Sort unique removes duplicate lines.
>
> This can be used to create a smaller training_text useful for finetuning.
>
> On Sat, Sep 15, 2018 at 9:23 PM, Shree Devi Kumar <[email protected] 
> <javascript:>> wrote:
>
>> *desired_characters* 
>>
>> This is used by Google internally when creating the training text.
>>
>> Should I enter all those compound character combinations to this file? 
>>
>> No, since this is not used by tesstrain.sh - at least in the open source 
>> version in Github.
>>
>> *okfonts.txt* 
>>
>> This lists the Unicode fonts used for the LSTM training.
>>
>> Can I include non Unicode fonts into this file? 
>>
>> NO. Because the rendered text will be incorrect.
>>
>> *sin.numbers*
>>
>> This file include all the number characters used in Sinhala.
>>
>> Unless something is changed in Google's internal training method, this 
>> should NOT have the number characters. Rather it should have patterns of 
>> how numbers maybe formatted when used in this language. It may help to look 
>> at the eng.numbers file for reference.
>>
>> *sin.punc* 
>> In lang data this contains punctuation combinations. 
>>
>> Similar to the numbers file, this should have patterns of punctuation 
>> characters used in the language. Again, refer to eng.punc for reference.
>>
>>
>> *sin.singles_text*
>>
>> Similar file to wordlist. Contains unique words followed by a new line
>>
>> In Devanagari it also has unique/rare syllables (compound character 
>> combinations). Without having the scripts used by Ray (Google) for 
>> training, it is difficult to say how this is used. I am guessing that these 
>> are used in a addition to training_text to build the unicharset.
>>
>> *sin.training_text* 
>>
>> The training_text in langdata_lstm seems to be random words, numbers and 
>> phrases (based on English and Devanagari). So this maybe based on word 
>> frequencies in language. While Ray's notes on training say to use text that 
>> is representative of the language or text to be recognized, the 
>> training_text does not seem to be full sentences. It's possible that this 
>> kind of training_text gives better results with LSTM for recognizing 
>> text/words not seen before. I do not really know.
>>
>> *sin.unicharset*
>>
>> This file will be created when creating training data
>>
>> Yes, please check the sin.lstm-unicharset in the sin.traineddata files to 
>> check that all required characters are there.
>>
>> *sin.wordlist*
>>
>> Contains unique words followed by a new line
>>
>> This dictionary as well as punc and numbers are used to create dawg files 
>> which are stored in traineddata files and provide some improvement in 
>> recognition.
>>
>> -------------------------
>>
>> What you could do is create a file with all valid characters and 
>> syllables ( compound character combinations) for Sinhala. Then use this 
>> file as input and grep the sin.training_text in langdata_lstm to mke sure 
>> that all combos are included in your training text for fine tuning.
>>
>>
>>
>> On Sat, Sep 15, 2018 at 7:43 PM, Shandigutt <[email protected] 
>> <javascript:>> wrote:
>>
>>> Hi,
>>>
>>> I downloaded latest lstm langdata from tesseract repository. I found it 
>>> consists of a lot of false data for Sinhala. I'm trying to train tesseract 
>>> for Sinhala. According to tesseract wiki guidelines, we need to create lang 
>>> data before creating training data using tesstrain.sh script. I'm 
>>> referring to the below wiki guidelines,
>>>
>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>>
>>> I couldn't find proper wiki guidelines on creating lang data. When I 
>>> inspected the 'sin' folder in langdata-lstm I found the below files,
>>>
>>>
>>>    - desired_characters
>>>    - okfonts.txt
>>>    - sin.numbers
>>>    - sin.punc
>>>    - sin.singles_text
>>>    - sin.training_text
>>>    - sin.unicharset
>>>    - sin.wordlist
>>>    
>>>
>>> Please let me know if there's a proper documentation that I can follow 
>>> if I create these files on my own from the scratch. According to my 
>>> observations I have the following idea of these files. If there's no any 
>>> proper documentation of them please correct me if I mention anything wrong 
>>> here,
>>>
>>> *desired_characters*
>>>
>>> This file contains all the unique characters found in the language. Each 
>>> character followed by new line. My question is Sinhala language has many 
>>> vowel characters that create compound characters with Sinhala consonants. 
>>> Unlike English once a vowel character is attached to a consonant it creates 
>>> a single compound character most of the time which I can erase from a 
>>> single keyboard backspace. Please refer to the below example,
>>>
>>> Example 1:
>>>
>>> Consonant : ද
>>>
>>> Vowel character : ො
>>>
>>> Compound character : දො
>>>
>>> Example 2:
>>>
>>> Consonent : බ
>>>
>>> Vowel character : ්
>>>
>>> Compound character : බ්
>>>
>>> So each consonant + different vowel characters it makes a lot of 
>>> compound characters. Should I enter all those compound character 
>>> combinations to this file?
>>>
>>>
>>> *okfonts.txt*
>>>
>>> This file includes the fonts I use in my training_text. Format is font 
>>> name followed by a new line. Can I include non Unicode fonts into this file?
>>>
>>> *sin.numbers*
>>>
>>> This file include all the number characters used in Sinhala. Number 
>>> character followed by a new line. Normally this contains only 10 characters
>>>
>>> *sin.punc*
>>>
>>> This character contains all the punctuation characters that can be used 
>>> in Sinhala text. Format is punctuation character followed by a new line. In 
>>> lang data this contains punctuation combinations. Please explain why?
>>>
>>> *sin.singles_text*
>>>
>>> Similar file to wordlist. Contains unique words followed by a new line
>>>
>>> *sin.training_text*
>>>
>>> Training text to be used when creating training data. Should contain 
>>> around 40000 text lines. Each line can have any amount of characters. It’s 
>>> better if this document contains text in multiple fonts that we have 
>>> defined in okfonts.txt. (These fonts can be passed as a command line 
>>> argument as well)
>>>
>>> *sin.unicharset*
>>>
>>> This file will be created when creating training data
>>>
>>> *sin.wordlist*
>>>
>>> Contains unique words followed by a new line
>>>
>>> Appreciate your response on this.
>>>
>>> Thanks
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] <javascript:>.
>>> To post to this group, send email to [email protected] 
>>> <javascript:>.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> -- 
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com up
>>
>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3ab51bfb-7605-477a-a59a-a76af940d5f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Documentation related to lang data

Reply via email to