Hello together, i have a basic understanding problem regarding the adaption of Tesseract4 to a modified language model. Just assume i modify the contents in https://github.com/tesseract-ocr/langdata_lstm/tree/master/deu to fit our text domain better (i know that takes a lot of steps but assume i got it done).
In my understanding the LSTM is trained basically with rendered variations of deu.training_text, so if i change that, i need to retrain the whole Network from scratch. But what if i don't do that but only compile a new trainddata-file including the "old" LSTM Network but the modified Dictionary-files? Do i still get the effect, that the LSTM-Recognizer https://github.com/tesseract-ocr/tesseract/blob/master/src/lstm/lstmrecognizer.cpp will prefer the words in the modified dictionary by a factor of 2.25 over the non-dictionary words? Would the effect be the same using a custom dictionary or do i get an additional benefit not e.g. by modifying https://github.com/tesseract-ocr/langdata_lstm/blob/master/deu/deu.bad_words that i cannot get with a custom dictionary? Best Regards, Lars -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fbf8aa73-694c-4b6e-9ada-44d6fe8c3a2e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

