Thank you very much for your kind help On Saturday, September 15, 2018 at 9:29:38 PM UTC+3, shree wrote: > > > cat ~/sin.syllables_text | while IFS=" > " read target; do grep -F -m 7 "$target" > ~/langdata_lstm/sin/sin.training_text ; done > tmp.txt > sort -u tmp.txt > ~/sin.sample7.training_text > > > The above will create a training text with min 7 samples for each line in > syllables_text. Change -m 7 to -m 1 to create file with just one sample of > each. Sort unique removes duplicate lines. > > This can be used to create a smaller training_text useful for finetuning. > > On Sat, Sep 15, 2018 at 9:23 PM, Shree Devi Kumar <[email protected] > <javascript:>> wrote: > >> *desired_characters* >> >> This is used by Google internally when creating the training text. >> >> Should I enter all those compound character combinations to this file? >> >> No, since this is not used by tesstrain.sh - at least in the open source >> version in Github. >> >> *okfonts.txt* >> >> This lists the Unicode fonts used for the LSTM training. >> >> Can I include non Unicode fonts into this file? >> >> NO. Because the rendered text will be incorrect. >> >> *sin.numbers* >> >> This file include all the number characters used in Sinhala. >> >> Unless something is changed in Google's internal training method, this >> should NOT have the number characters. Rather it should have patterns of >> how numbers maybe formatted when used in this language. It may help to look >> at the eng.numbers file for reference. >> >> *sin.punc* >> In lang data this contains punctuation combinations. >> >> Similar to the numbers file, this should have patterns of punctuation >> characters used in the language. Again, refer to eng.punc for reference. >> >> >> *sin.singles_text* >> >> Similar file to wordlist. Contains unique words followed by a new line >> >> In Devanagari it also has unique/rare syllables (compound character >> combinations). Without having the scripts used by Ray (Google) for >> training, it is difficult to say how this is used. I am guessing that these >> are used in a addition to training_text to build the unicharset. >> >> *sin.training_text* >> >> The training_text in langdata_lstm seems to be random words, numbers and >> phrases (based on English and Devanagari). So this maybe based on word >> frequencies in language. While Ray's notes on training say to use text that >> is representative of the language or text to be recognized, the >> training_text does not seem to be full sentences. It's possible that this >> kind of training_text gives better results with LSTM for recognizing >> text/words not seen before. I do not really know. >> >> *sin.unicharset* >> >> This file will be created when creating training data >> >> Yes, please check the sin.lstm-unicharset in the sin.traineddata files to >> check that all required characters are there. >> >> *sin.wordlist* >> >> Contains unique words followed by a new line >> >> This dictionary as well as punc and numbers are used to create dawg files >> which are stored in traineddata files and provide some improvement in >> recognition. >> >> ------------------------- >> >> What you could do is create a file with all valid characters and >> syllables ( compound character combinations) for Sinhala. Then use this >> file as input and grep the sin.training_text in langdata_lstm to mke sure >> that all combos are included in your training text for fine tuning. >> >> >> >> On Sat, Sep 15, 2018 at 7:43 PM, Shandigutt <[email protected] >> <javascript:>> wrote: >> >>> Hi, >>> >>> I downloaded latest lstm langdata from tesseract repository. I found it >>> consists of a lot of false data for Sinhala. I'm trying to train tesseract >>> for Sinhala. According to tesseract wiki guidelines, we need to create lang >>> data before creating training data using tesstrain.sh script. I'm >>> referring to the below wiki guidelines, >>> >>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 >>> >>> I couldn't find proper wiki guidelines on creating lang data. When I >>> inspected the 'sin' folder in langdata-lstm I found the below files, >>> >>> >>> - desired_characters >>> - okfonts.txt >>> - sin.numbers >>> - sin.punc >>> - sin.singles_text >>> - sin.training_text >>> - sin.unicharset >>> - sin.wordlist >>> >>> >>> Please let me know if there's a proper documentation that I can follow >>> if I create these files on my own from the scratch. According to my >>> observations I have the following idea of these files. If there's no any >>> proper documentation of them please correct me if I mention anything wrong >>> here, >>> >>> *desired_characters* >>> >>> This file contains all the unique characters found in the language. Each >>> character followed by new line. My question is Sinhala language has many >>> vowel characters that create compound characters with Sinhala consonants. >>> Unlike English once a vowel character is attached to a consonant it creates >>> a single compound character most of the time which I can erase from a >>> single keyboard backspace. Please refer to the below example, >>> >>> Example 1: >>> >>> Consonant : ද >>> >>> Vowel character : ො >>> >>> Compound character : දො >>> >>> Example 2: >>> >>> Consonent : බ >>> >>> Vowel character : ් >>> >>> Compound character : බ් >>> >>> So each consonant + different vowel characters it makes a lot of >>> compound characters. Should I enter all those compound character >>> combinations to this file? >>> >>> >>> *okfonts.txt* >>> >>> This file includes the fonts I use in my training_text. Format is font >>> name followed by a new line. Can I include non Unicode fonts into this file? >>> >>> *sin.numbers* >>> >>> This file include all the number characters used in Sinhala. Number >>> character followed by a new line. Normally this contains only 10 characters >>> >>> *sin.punc* >>> >>> This character contains all the punctuation characters that can be used >>> in Sinhala text. Format is punctuation character followed by a new line. In >>> lang data this contains punctuation combinations. Please explain why? >>> >>> *sin.singles_text* >>> >>> Similar file to wordlist. Contains unique words followed by a new line >>> >>> *sin.training_text* >>> >>> Training text to be used when creating training data. Should contain >>> around 40000 text lines. Each line can have any amount of characters. It’s >>> better if this document contains text in multiple fonts that we have >>> defined in okfonts.txt. (These fonts can be passed as a command line >>> argument as well) >>> >>> *sin.unicharset* >>> >>> This file will be created when creating training data >>> >>> *sin.wordlist* >>> >>> Contains unique words followed by a new line >>> >>> Appreciate your response on this. >>> >>> Thanks >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected] <javascript:>. >>> To post to this group, send email to [email protected] >>> <javascript:>. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com up >> > > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3ab51bfb-7605-477a-a59a-a76af940d5f3%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

