Ray is the best person to answer your questions. I can only share my experience trying to train using Devanagari script.
Fine Tune will work if all you want to change is a font, with the same unicharset. This works well for Latin script based languages but not complex scripts. eg. for devanagari, the consonants, vowel marks, combining marks together make an 'akshara' glyph, the unicharset in the language model has these. If the new training text has additional new akshara glyphs, fine tune training gives errors such as Encoding of string failed! For Devanagari, I have tried training by changing top layer. This adds the new akshara glyphs. However, for accuracy training has to be done till 0.01% which takes very long - I have not been able to reach that level of accuracy in my training. Again, this may impact the originally trained fonts. Currently using --eval_listfile for a different set of images during training does not work. -dawgs are a way of compressing the wordlists. https://tesseract-ocr.repairfaq.org/allaboutdawg.html There is no way to finetune the legacy engine. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, May 29, 2017 at 9:19 AM, Akira Hayakawa <[email protected]> wrote: > Thanks for the reply. I understand. > > There are couple of questions related to this topic. > > 1) > > training_text may only include the text for the next (or new) learning? > For example, the LSTM net have learned a line "I have a pen" and we need > it to learn a line "I have a pineapple" then does training_text only > include the pineapple line but the pen line is removed? > > 2) > > In https://github.com/tesseract-ocr/tesseract/wiki/Training- > Tesseract-%E2%80%93-tesstrain.sh > > the files in langdata other than training_text are said to be optional. > I suppose these files are internally handled as hints. Am I right? > And what if these files are inconsistent with training_text? For example, > wordlist may contain fairly irrelevant words. > Should I erase the optional files if they are inconsistent? > > 3) > > Closely related to 2). > When the langdata doesn't have these optional files. Tesseract internally > generates the files from training_text? > > 4) > > Is there no way to fine-tune legacy tesseract? > > 5) > > In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 > > These is a note: > >> NOTE Tesseract 4.00 will now run happily with a traineddata file that >> contains just lang.lstm.The lstm-*-dawgs are optional, and none of the >> other files are required or used with OEM_LSTM_ONLY as the OCR engine mode. >> No >> bigrams, unichar ambigs or any of the other files are needed or even have >> any effect if present. > > > Does this mean if we use LSTM only (legacy tesseract is going to be purged > in the future release right?), the optionals files like wordlist are > entirely needless? This sounds natural to me because as far as I understand > the LSTM net only learn a text line from a sequence of byte or image. > btw, What does "dawgs" mean? > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUuFn1Fxpv5h-RdHA%3DvZ%3DgY8TBq%2Bj%3DwCPrwmLP7TZF%2BcQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

