cat ~/sin.syllables_text | while IFS=" " read target; do grep -F -m 7 "$target" ~/langdata_lstm/sin/sin.training_text ; done > tmp.txt sort -u tmp.txt > ~/sin.sample7.training_text
The above will create a training text with min 7 samples for each line in syllables_text. Change -m 7 to -m 1 to create file with just one sample of each. Sort unique removes duplicate lines. This can be used to create a smaller training_text useful for finetuning. On Sat, Sep 15, 2018 at 9:23 PM, Shree Devi Kumar <[email protected]> wrote: > *desired_characters* > > This is used by Google internally when creating the training text. > > Should I enter all those compound character combinations to this file? > > No, since this is not used by tesstrain.sh - at least in the open source > version in Github. > > *okfonts.txt* > > This lists the Unicode fonts used for the LSTM training. > > Can I include non Unicode fonts into this file? > > NO. Because the rendered text will be incorrect. > > *sin.numbers* > > This file include all the number characters used in Sinhala. > > Unless something is changed in Google's internal training method, this > should NOT have the number characters. Rather it should have patterns of > how numbers maybe formatted when used in this language. It may help to look > at the eng.numbers file for reference. > > *sin.punc* > In lang data this contains punctuation combinations. > > Similar to the numbers file, this should have patterns of punctuation > characters used in the language. Again, refer to eng.punc for reference. > > > *sin.singles_text* > > Similar file to wordlist. Contains unique words followed by a new line > > In Devanagari it also has unique/rare syllables (compound character > combinations). Without having the scripts used by Ray (Google) for > training, it is difficult to say how this is used. I am guessing that these > are used in a addition to training_text to build the unicharset. > > *sin.training_text* > > The training_text in langdata_lstm seems to be random words, numbers and > phrases (based on English and Devanagari). So this maybe based on word > frequencies in language. While Ray's notes on training say to use text that > is representative of the language or text to be recognized, the > training_text does not seem to be full sentences. It's possible that this > kind of training_text gives better results with LSTM for recognizing > text/words not seen before. I do not really know. > > *sin.unicharset* > > This file will be created when creating training data > > Yes, please check the sin.lstm-unicharset in the sin.traineddata files to > check that all required characters are there. > > *sin.wordlist* > > Contains unique words followed by a new line > > This dictionary as well as punc and numbers are used to create dawg files > which are stored in traineddata files and provide some improvement in > recognition. > > ------------------------- > > What you could do is create a file with all valid characters and syllables ( > compound character combinations) for Sinhala. Then use this file as input > and grep the sin.training_text in langdata_lstm to mke sure that all combos > are included in your training text for fine tuning. > > > > On Sat, Sep 15, 2018 at 7:43 PM, Shandigutt <[email protected]> wrote: > >> Hi, >> >> I downloaded latest lstm langdata from tesseract repository. I found it >> consists of a lot of false data for Sinhala. I'm trying to train tesseract >> for Sinhala. According to tesseract wiki guidelines, we need to create lang >> data before creating training data using tesstrain.sh script. I'm >> referring to the below wiki guidelines, >> >> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 >> >> I couldn't find proper wiki guidelines on creating lang data. When I >> inspected the 'sin' folder in langdata-lstm I found the below files, >> >> >> - desired_characters >> - okfonts.txt >> - sin.numbers >> - sin.punc >> - sin.singles_text >> - sin.training_text >> - sin.unicharset >> - sin.wordlist >> >> >> Please let me know if there's a proper documentation that I can follow if >> I create these files on my own from the scratch. According to my >> observations I have the following idea of these files. If there's no any >> proper documentation of them please correct me if I mention anything wrong >> here, >> >> *desired_characters* >> >> This file contains all the unique characters found in the language. Each >> character followed by new line. My question is Sinhala language has many >> vowel characters that create compound characters with Sinhala consonants. >> Unlike English once a vowel character is attached to a consonant it creates >> a single compound character most of the time which I can erase from a >> single keyboard backspace. Please refer to the below example, >> >> Example 1: >> >> Consonant : ද >> >> Vowel character : ො >> >> Compound character : දො >> >> Example 2: >> >> Consonent : බ >> >> Vowel character : ් >> >> Compound character : බ් >> >> So each consonant + different vowel characters it makes a lot of compound >> characters. Should I enter all those compound character combinations to >> this file? >> >> >> *okfonts.txt* >> >> This file includes the fonts I use in my training_text. Format is font >> name followed by a new line. Can I include non Unicode fonts into this file? >> >> *sin.numbers* >> >> This file include all the number characters used in Sinhala. Number >> character followed by a new line. Normally this contains only 10 characters >> >> *sin.punc* >> >> This character contains all the punctuation characters that can be used >> in Sinhala text. Format is punctuation character followed by a new line. In >> lang data this contains punctuation combinations. Please explain why? >> >> *sin.singles_text* >> >> Similar file to wordlist. Contains unique words followed by a new line >> >> *sin.training_text* >> >> Training text to be used when creating training data. Should contain >> around 40000 text lines. Each line can have any amount of characters. It’s >> better if this document contains text in multiple fonts that we have >> defined in okfonts.txt. (These fonts can be passed as a command line >> argument as well) >> >> *sin.unicharset* >> >> This file will be created when creating training data >> >> *sin.wordlist* >> >> Contains unique words followed by a new line >> >> Appreciate your response on this. >> >> Thanks >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ >> msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642% >> 40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com up > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW%2B-3iiU3%2BcxxSWnAiuk1sBD41vfAvbBdXubQQav17zqw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

