[tesseract-ocr] Improvement of language model / is my understanding correct?

'Lars Fricke' via tesseract-ocr Tue, 22 Jan 2019 03:23:10 -0800

Hello together,

i have a basic understanding problem regarding the adaption of Tesseract4 
to a modified language model. Just assume i modify the contents 
in https://github.com/tesseract-ocr/langdata_lstm/tree/master/deu to fit 
our text domain better (i know that takes a lot of steps but assume i got 
it done).


In my understanding the LSTM is trained basically with rendered variations 
of deu.training_text, so if i change that, i need to retrain the whole 
Network from scratch.

But what if i don't do that but only compile a new trainddata-file 
including the "old" LSTM Network but the modified Dictionary-files? Do i 
still get the effect, that the 
LSTM-Recognizer 
https://github.com/tesseract-ocr/tesseract/blob/master/src/lstm/lstmrecognizer.cpp
 will 
prefer the words in the modified dictionary by a factor of 2.25 over the 
non-dictionary words? Would the effect be the same using a custom 
dictionary or do i get an additional benefit not e.g. by modifying 
https://github.com/tesseract-ocr/langdata_lstm/blob/master/deu/deu.bad_words 
that i cannot get with a custom dictionary?

Best Regards,
Lars


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fbf8aa73-694c-4b6e-9ada-44d6fe8c3a2e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Improvement of language model / is my understanding correct?

Reply via email to