Also look at all three scripts used for training https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh
https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain_utils.sh https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh https://github.com/tesseract-ocr/tesseract/blob/8e79297dcefecdb929d753d28554fec51417ec39/ccutil/unicharcompress.cpp // Most simple scripts // will encode a single index to a UTF8-string, but Chinese, Japanese, Korean // and the Indic scripts will contain a many-to-many mapping. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, May 29, 2017 at 10:44 AM, ShreeDevi Kumar <[email protected]> wrote: > Ray is the best person to answer your questions. I can only share my > experience trying to train using Devanagari script. > > Fine Tune will work if all you want to change is a font, with the same > unicharset. This works well for Latin script based languages but not > complex scripts. > > eg. for devanagari, the consonants, vowel marks, combining marks together > make an 'akshara' glyph, the unicharset in the language model has these. If > the new training text has additional new akshara glyphs, fine tune training > gives errors such as Encoding of string failed! > > For Devanagari, I have tried training by changing top layer. This adds the > new akshara glyphs. However, for accuracy training has to be done till > 0.01% which takes very long - I have not been able to reach that level of > accuracy in my training. Again, this may impact the originally trained > fonts. Currently using --eval_listfile for a different set of images during > training does not work. > > -dawgs are a way of compressing the wordlists. https://tesseract- > ocr.repairfaq.org/allaboutdawg.html > > There is no way to finetune the legacy engine. > > > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Mon, May 29, 2017 at 9:19 AM, Akira Hayakawa <[email protected]> > wrote: > >> Thanks for the reply. I understand. >> >> There are couple of questions related to this topic. >> >> 1) >> >> training_text may only include the text for the next (or new) learning? >> For example, the LSTM net have learned a line "I have a pen" and we need >> it to learn a line "I have a pineapple" then does training_text only >> include the pineapple line but the pen line is removed? >> >> 2) >> >> In https://github.com/tesseract-ocr/tesseract/wiki/Training-Tes >> seract-%E2%80%93-tesstrain.sh >> >> the files in langdata other than training_text are said to be optional. >> I suppose these files are internally handled as hints. Am I right? >> And what if these files are inconsistent with training_text? For example, >> wordlist may contain fairly irrelevant words. >> Should I erase the optional files if they are inconsistent? >> >> 3) >> >> Closely related to 2). >> When the langdata doesn't have these optional files. Tesseract internally >> generates the files from training_text? >> >> 4) >> >> Is there no way to fine-tune legacy tesseract? >> >> 5) >> >> In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 >> >> These is a note: >> >>> NOTE Tesseract 4.00 will now run happily with a traineddata file that >>> contains just lang.lstm.The lstm-*-dawgs are optional, and none of the >>> other files are required or used with OEM_LSTM_ONLY as the OCR engine mode. >>> No >>> bigrams, unichar ambigs or any of the other files are needed or even have >>> any effect if present. >> >> >> Does this mean if we use LSTM only (legacy tesseract is going to be >> purged in the future release right?), the optionals files like wordlist are >> entirely needless? This sounds natural to me because as far as I understand >> the LSTM net only learn a text line from a sequence of byte or image. >> btw, What does "dawgs" mean? >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ms >> gid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU8viYcUZq2fE45AiSSSr3UZmmSm10%2B4goHJCKhKfmgfw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

