Thanks for the reply. I understand. There are couple of questions related to this topic.
1) training_text may only include the text for the next (or new) learning? For example, the LSTM net have learned a line "I have a pen" and we need it to learn a line "I have a pineapple" then does training_text only include the pineapple line but the pen line is removed? 2) In https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh the files in langdata other than training_text are said to be optional. I suppose these files are internally handled as hints. Am I right? And what if these files are inconsistent with training_text? For example, wordlist may contain fairly irrelevant words. Should I erase the optional files if they are inconsistent? 3) Closely related to 2). When the langdata doesn't have these optional files. Tesseract internally generates the files from training_text? 4) Is there no way to fine-tune legacy tesseract? 5) In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 These is a note: > NOTE Tesseract 4.00 will now run happily with a traineddata file that > contains just lang.lstm.The lstm-*-dawgs are optional, and none of the > other files are required or used with OEM_LSTM_ONLY as the OCR engine mode. > No > bigrams, unichar ambigs or any of the other files are needed or even have > any effect if present. Does this mean if we use LSTM only (legacy tesseract is going to be purged in the future release right?), the optionals files like wordlist are entirely needless? This sounds natural to me because as far as I understand the LSTM net only learn a text line from a sequence of byte or image. btw, What does "dawgs" mean? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

