[tesseract-ocr] Training for specific words

James Q Wed, 04 Jul 2018 08:42:29 -0700

I would like to improve accuracy by training tesseract 4 to use a context 
specific list of words. For example countries. I have created a 
eng.finetune.training_text file containing country names as well as common 
country word (e.g. Republic, Island, New etc.). This (as far as I can tell) 
restricts the char set to those in the file and represents a reasonable 
distribution of characters used in countries. Doing this appears to improve 
accuracy on my testing so far.


What I have also tried is replacing eng.wordlist with the list of country 
related words but this makes accuracy worse. Even though every word in my 
ground truth test set is present in that list.

Is eng.wordlist the wrong thing to change here? Is there another file (or 
combination of files) I need to put my words in?

Any help would be much appreciated.

Thanks
James

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/544d4a87-0e9a-42b0-b943-dd7af3d4d437%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Training for specific words

Reply via email to