[tesseract-ocr] Recommendation on how to best train Tesseract for new UTF-8 symbols

Rafay Kalim Tue, 21 May 2019 07:38:19 -0700

Hey, so I am trying to train a new Tesseract model to only recognize 
certain UTF-8 symbols as I want an OCR that only recognizes these symbols 
and not other English letters etc. I realize there are two ways I can do 
this - one is to fine tune Tesseract over the normal English model and then 
blacklist the English text or train a completely new model that only 
recognizes this text. I was wondering if I could get some input into which 
of these - or another method, is better for ease, time and accuracy.


The context is I will have some various texts on a board and I want to 
recognize the locations of the symbols. However, I don't want to recognize 
any of the English or anything else as this may mess with my post 
processing. I have tried a few locations (like restricting where these 
symbols can be on the board and then only scanning the text in those 
strips) but I am not satisfied with the results. Additionally, I can also 
control the font and the size of the text on the board and everything else, 
except the actual codes. 

Thanks for the help!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3237ae86-db20-467c-bebc-6b45f854e799%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Recommendation on how to best train Tesseract for new UTF-8 symbols

Reply via email to