Re: [tesseract-ocr] Documentation related to lang data

Pubudu Tharaka Viswakula Sat, 15 Sep 2018 10:15:28 -0700

Thank you very much Shree

On Sat, 15 Sep 2018, 20:13 Shree Devi Kumar, <[email protected]> wrote:


> >Are they created using the same files we're talking as sin.numbers,
> sin.punc and sin.wordlist?
>
> Yes, the dawg files are created from these and the unicharset. The same
> unicharset should be used for lstm training.
>
> On Sat, 15 Sep 2018, 21:46 Pubudu Tharaka Viswakula, <[email protected]>
> wrote:
>
>> Hi Shree,
>>
>> Thank you very much for improving my awareness. I have one more question,
>>
>> When we create training files it creates the traineddata file. It's
>> better to include word-dwag, punc-dwag and number-dwag into it. Are they
>> created using the same files we're talking as sin.numbers, sin.punc and
>> sin.wordlist? As you explained me in an earlier questions they gets
>> included into traineddata file when we use tesstrain.sh right?
>>
>> Thanks
>>
>> On Sat, 15 Sep 2018, 18:53 Shree Devi Kumar, <[email protected]>
>> wrote:
>>
>>> *desired_characters*
>>>
>>> This is used by Google internally when creating the training text.
>>>
>>> Should I enter all those compound character combinations to this file?
>>>
>>> No, since this is not used by tesstrain.sh - at least in the open source
>>> version in Github.
>>>
>>> *okfonts.txt*
>>>
>>> This lists the Unicode fonts used for the LSTM training.
>>>
>>> Can I include non Unicode fonts into this file?
>>>
>>> NO. Because the rendered text will be incorrect.
>>>
>>> *sin.numbers*
>>>
>>> This file include all the number characters used in Sinhala.
>>>
>>> Unless something is changed in Google's internal training method, this
>>> should NOT have the number characters. Rather it should have patterns of
>>> how numbers maybe formatted when used in this language. It may help to look
>>> at the eng.numbers file for reference.
>>>
>>> *sin.punc*
>>> In lang data this contains punctuation combinations.
>>>
>>> Similar to the numbers file, this should have patterns of punctuation
>>> characters used in the language. Again, refer to eng.punc for reference.
>>>
>>>
>>> *sin.singles_text*
>>>
>>> Similar file to wordlist. Contains unique words followed by a new line
>>>
>>> In Devanagari it also has unique/rare syllables (compound character
>>> combinations). Without having the scripts used by Ray (Google) for
>>> training, it is difficult to say how this is used. I am guessing that these
>>> are used in a addition to training_text to build the unicharset.
>>>
>>> *sin.training_text*
>>>
>>> The training_text in langdata_lstm seems to be random words, numbers and
>>> phrases (based on English and Devanagari). So this maybe based on word
>>> frequencies in language. While Ray's notes on training say to use text that
>>> is representative of the language or text to be recognized, the
>>> training_text does not seem to be full sentences. It's possible that this
>>> kind of training_text gives better results with LSTM for recognizing
>>> text/words not seen before. I do not really know.
>>>
>>> *sin.unicharset*
>>>
>>> This file will be created when creating training data
>>>
>>> Yes, please check the sin.lstm-unicharset in the sin.traineddata files
>>> to check that all required characters are there.
>>>
>>> *sin.wordlist*
>>>
>>> Contains unique words followed by a new line
>>>
>>> This dictionary as well as punc and numbers are used to create dawg
>>> files which are stored in traineddata files and provide some improvement in
>>> recognition.
>>>
>>> -------------------------
>>>
>>> What you could do is create a file with all valid characters and
>>> syllables ( compound character combinations) for Sinhala. Then use this
>>> file as input and grep the sin.training_text in langdata_lstm to mke sure
>>> that all combos are included in your training text for fine tuning.
>>>
>>>
>>>
>>> On Sat, Sep 15, 2018 at 7:43 PM, Shandigutt <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I downloaded latest lstm langdata from tesseract repository. I found it
>>>> consists of a lot of false data for Sinhala. I'm trying to train tesseract
>>>> for Sinhala. According to tesseract wiki guidelines, we need to create lang
>>>> data before creating training data using tesstrain.sh script. I'm
>>>> referring to the below wiki guidelines,
>>>>
>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>>>
>>>> I couldn't find proper wiki guidelines on creating lang data. When I
>>>> inspected the 'sin' folder in langdata-lstm I found the below files,
>>>>
>>>>
>>>>    - desired_characters
>>>>    - okfonts.txt
>>>>    - sin.numbers
>>>>    - sin.punc
>>>>    - sin.singles_text
>>>>    - sin.training_text
>>>>    - sin.unicharset
>>>>    - sin.wordlist
>>>>
>>>>
>>>> Please let me know if there's a proper documentation that I can follow
>>>> if I create these files on my own from the scratch. According to my
>>>> observations I have the following idea of these files. If there's no any
>>>> proper documentation of them please correct me if I mention anything wrong
>>>> here,
>>>>
>>>> *desired_characters*
>>>>
>>>> This file contains all the unique characters found in the language.
>>>> Each character followed by new line. My question is Sinhala language has
>>>> many vowel characters that create compound characters with Sinhala
>>>> consonants. Unlike English once a vowel character is attached to a
>>>> consonant it creates a single compound character most of the time which I
>>>> can erase from a single keyboard backspace. Please refer to the below
>>>> example,
>>>>
>>>> Example 1:
>>>>
>>>> Consonant : ද
>>>>
>>>> Vowel character : ො
>>>>
>>>> Compound character : දො
>>>>
>>>> Example 2:
>>>>
>>>> Consonent : බ
>>>>
>>>> Vowel character : ්
>>>>
>>>> Compound character : බ්
>>>>
>>>> So each consonant + different vowel characters it makes a lot of
>>>> compound characters. Should I enter all those compound character
>>>> combinations to this file?
>>>>
>>>>
>>>> *okfonts.txt*
>>>>
>>>> This file includes the fonts I use in my training_text. Format is font
>>>> name followed by a new line. Can I include non Unicode fonts into this 
>>>> file?
>>>>
>>>> *sin.numbers*
>>>>
>>>> This file include all the number characters used in Sinhala. Number
>>>> character followed by a new line. Normally this contains only 10 characters
>>>>
>>>> *sin.punc*
>>>>
>>>> This character contains all the punctuation characters that can be used
>>>> in Sinhala text. Format is punctuation character followed by a new line. In
>>>> lang data this contains punctuation combinations. Please explain why?
>>>>
>>>> *sin.singles_text*
>>>>
>>>> Similar file to wordlist. Contains unique words followed by a new line
>>>>
>>>> *sin.training_text*
>>>>
>>>> Training text to be used when creating training data. Should contain
>>>> around 40000 text lines. Each line can have any amount of characters. It’s
>>>> better if this document contains text in multiple fonts that we have
>>>> defined in okfonts.txt. (These fonts can be passed as a command line
>>>> argument as well)
>>>>
>>>> *sin.unicharset*
>>>>
>>>> This file will be created when creating training data
>>>>
>>>> *sin.wordlist*
>>>>
>>>> Contains unique words followed by a new line
>>>>
>>>> Appreciate your response on this.
>>>>
>>>> Thanks
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com up
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVwUEM4SmsO8nSVwB76wsmdbzynwcK8-30_cDnEawW2Gg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVwUEM4SmsO8nSVwB76wsmdbzynwcK8-30_cDnEawW2Gg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAKOih%3DkGPrkU%2BCm1KFooKyUknQjR0i8zxf6ZrHzZkLg9vwNjfA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAKOih%3DkGPrkU%2BCm1KFooKyUknQjR0i8zxf6ZrHzZkLg9vwNjfA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU58m%3DYybzA-sEMbEp80Zzke%3DYKK4YfKAG34zpaFy2Xww%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU58m%3DYybzA-sEMbEp80Zzke%3DYKK4YfKAG34zpaFy2Xww%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAKOih%3D%3DcQ44Z2z_%3DinuxQJCRwYHQ8dBORciacryq5LC5A3nzog%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Documentation related to lang data

Reply via email to