I would like to improve accuracy by training tesseract 4 to use a context specific list of words. For example countries. I have created a eng.finetune.training_text file containing country names as well as common country word (e.g. Republic, Island, New etc.). This (as far as I can tell) restricts the char set to those in the file and represents a reasonable distribution of characters used in countries. Doing this appears to improve accuracy on my testing so far.
What I have also tried is replacing eng.wordlist with the list of country related words but this makes accuracy worse. Even though every word in my ground truth test set is present in that list. Is eng.wordlist the wrong thing to change here? Is there another file (or combination of files) I need to put my words in? Any help would be much appreciated. Thanks James -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/544d4a87-0e9a-42b0-b943-dd7af3d4d437%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

