Thank you very much Shree On Sat, 15 Sep 2018, 20:13 Shree Devi Kumar, <[email protected]> wrote:
> >Are they created using the same files we're talking as sin.numbers, > sin.punc and sin.wordlist? > > Yes, the dawg files are created from these and the unicharset. The same > unicharset should be used for lstm training. > > On Sat, 15 Sep 2018, 21:46 Pubudu Tharaka Viswakula, <[email protected]> > wrote: > >> Hi Shree, >> >> Thank you very much for improving my awareness. I have one more question, >> >> When we create training files it creates the traineddata file. It's >> better to include word-dwag, punc-dwag and number-dwag into it. Are they >> created using the same files we're talking as sin.numbers, sin.punc and >> sin.wordlist? As you explained me in an earlier questions they gets >> included into traineddata file when we use tesstrain.sh right? >> >> Thanks >> >> On Sat, 15 Sep 2018, 18:53 Shree Devi Kumar, <[email protected]> >> wrote: >> >>> *desired_characters* >>> >>> This is used by Google internally when creating the training text. >>> >>> Should I enter all those compound character combinations to this file? >>> >>> No, since this is not used by tesstrain.sh - at least in the open source >>> version in Github. >>> >>> *okfonts.txt* >>> >>> This lists the Unicode fonts used for the LSTM training. >>> >>> Can I include non Unicode fonts into this file? >>> >>> NO. Because the rendered text will be incorrect. >>> >>> *sin.numbers* >>> >>> This file include all the number characters used in Sinhala. >>> >>> Unless something is changed in Google's internal training method, this >>> should NOT have the number characters. Rather it should have patterns of >>> how numbers maybe formatted when used in this language. It may help to look >>> at the eng.numbers file for reference. >>> >>> *sin.punc* >>> In lang data this contains punctuation combinations. >>> >>> Similar to the numbers file, this should have patterns of punctuation >>> characters used in the language. Again, refer to eng.punc for reference. >>> >>> >>> *sin.singles_text* >>> >>> Similar file to wordlist. Contains unique words followed by a new line >>> >>> In Devanagari it also has unique/rare syllables (compound character >>> combinations). Without having the scripts used by Ray (Google) for >>> training, it is difficult to say how this is used. I am guessing that these >>> are used in a addition to training_text to build the unicharset. >>> >>> *sin.training_text* >>> >>> The training_text in langdata_lstm seems to be random words, numbers and >>> phrases (based on English and Devanagari). So this maybe based on word >>> frequencies in language. While Ray's notes on training say to use text that >>> is representative of the language or text to be recognized, the >>> training_text does not seem to be full sentences. It's possible that this >>> kind of training_text gives better results with LSTM for recognizing >>> text/words not seen before. I do not really know. >>> >>> *sin.unicharset* >>> >>> This file will be created when creating training data >>> >>> Yes, please check the sin.lstm-unicharset in the sin.traineddata files >>> to check that all required characters are there. >>> >>> *sin.wordlist* >>> >>> Contains unique words followed by a new line >>> >>> This dictionary as well as punc and numbers are used to create dawg >>> files which are stored in traineddata files and provide some improvement in >>> recognition. >>> >>> ------------------------- >>> >>> What you could do is create a file with all valid characters and >>> syllables ( compound character combinations) for Sinhala. Then use this >>> file as input and grep the sin.training_text in langdata_lstm to mke sure >>> that all combos are included in your training text for fine tuning. >>> >>> >>> >>> On Sat, Sep 15, 2018 at 7:43 PM, Shandigutt <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I downloaded latest lstm langdata from tesseract repository. I found it >>>> consists of a lot of false data for Sinhala. I'm trying to train tesseract >>>> for Sinhala. According to tesseract wiki guidelines, we need to create lang >>>> data before creating training data using tesstrain.sh script. I'm >>>> referring to the below wiki guidelines, >>>> >>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 >>>> >>>> I couldn't find proper wiki guidelines on creating lang data. When I >>>> inspected the 'sin' folder in langdata-lstm I found the below files, >>>> >>>> >>>> - desired_characters >>>> - okfonts.txt >>>> - sin.numbers >>>> - sin.punc >>>> - sin.singles_text >>>> - sin.training_text >>>> - sin.unicharset >>>> - sin.wordlist >>>> >>>> >>>> Please let me know if there's a proper documentation that I can follow >>>> if I create these files on my own from the scratch. According to my >>>> observations I have the following idea of these files. If there's no any >>>> proper documentation of them please correct me if I mention anything wrong >>>> here, >>>> >>>> *desired_characters* >>>> >>>> This file contains all the unique characters found in the language. >>>> Each character followed by new line. My question is Sinhala language has >>>> many vowel characters that create compound characters with Sinhala >>>> consonants. Unlike English once a vowel character is attached to a >>>> consonant it creates a single compound character most of the time which I >>>> can erase from a single keyboard backspace. Please refer to the below >>>> example, >>>> >>>> Example 1: >>>> >>>> Consonant : ද >>>> >>>> Vowel character : ො >>>> >>>> Compound character : දො >>>> >>>> Example 2: >>>> >>>> Consonent : බ >>>> >>>> Vowel character : ් >>>> >>>> Compound character : බ් >>>> >>>> So each consonant + different vowel characters it makes a lot of >>>> compound characters. Should I enter all those compound character >>>> combinations to this file? >>>> >>>> >>>> *okfonts.txt* >>>> >>>> This file includes the fonts I use in my training_text. Format is font >>>> name followed by a new line. Can I include non Unicode fonts into this >>>> file? >>>> >>>> *sin.numbers* >>>> >>>> This file include all the number characters used in Sinhala. Number >>>> character followed by a new line. Normally this contains only 10 characters >>>> >>>> *sin.punc* >>>> >>>> This character contains all the punctuation characters that can be used >>>> in Sinhala text. Format is punctuation character followed by a new line. In >>>> lang data this contains punctuation combinations. Please explain why? >>>> >>>> *sin.singles_text* >>>> >>>> Similar file to wordlist. Contains unique words followed by a new line >>>> >>>> *sin.training_text* >>>> >>>> Training text to be used when creating training data. Should contain >>>> around 40000 text lines. Each line can have any amount of characters. It’s >>>> better if this document contains text in multiple fonts that we have >>>> defined in okfonts.txt. (These fonts can be passed as a command line >>>> argument as well) >>>> >>>> *sin.unicharset* >>>> >>>> This file will be created when creating training data >>>> >>>> *sin.wordlist* >>>> >>>> Contains unique words followed by a new line >>>> >>>> Appreciate your response on this. >>>> >>>> Thanks >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com up >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVwUEM4SmsO8nSVwB76wsmdbzynwcK8-30_cDnEawW2Gg%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVwUEM4SmsO8nSVwB76wsmdbzynwcK8-30_cDnEawW2Gg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAKOih%3DkGPrkU%2BCm1KFooKyUknQjR0i8zxf6ZrHzZkLg9vwNjfA%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAKOih%3DkGPrkU%2BCm1KFooKyUknQjR0i8zxf6ZrHzZkLg9vwNjfA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU58m%3DYybzA-sEMbEp80Zzke%3DYKK4YfKAG34zpaFy2Xww%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU58m%3DYybzA-sEMbEp80Zzke%3DYKK4YfKAG34zpaFy2Xww%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAKOih%3D%3DcQ44Z2z_%3DinuxQJCRwYHQ8dBORciacryq5LC5A3nzog%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

