[tesseract-ocr] Evaluating Tesseract with new domain-specific documents

Matthew Hodgskiss Fri, 25 Jan 2019 02:57:27 -0800

Hi,

I am interested in evaluating the performance of Tesseract against some 
domain specific test. I would like to perform a baseline using vanilla 
settings and then with some domain-specific user-words and user-patterns as 
documented here 
<https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage>.
Is it possible to leverage the OCR evaluation process, which must be 
performed during model training to calculate word and character error rates 
on new (domain-specific) documents?


If this is not possible, then I could synthesise my own scan images from 
documents using ImageMagick 
<https://gist.github.com/ThisIsBenny/1e669954d0fd0a945e38d0670c670c3c> but 
it would be good if anyone could recommend a standard algorithm/library for 
calculating character and word error rates.

Thanks in advance

Matt



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5cb0a65c-dae5-431b-9d0c-2c099d2cf90b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Evaluating Tesseract with new domain-specific documents

Reply via email to