Font Repository

iprovalov Fri, 09 Sep 2011 18:37:45 -0700

We have some old, relatively low image quality content (1800's
periodicals).  We compared Tesseract vs Omnipage using some random 30
page samples from different publications.  Here are the results
(Character Error Rate / Word Error Rate):


30 docs, out-of-the-box, no tuning:
1. Omnipage: 92% / 83%
2. Tesseract: 74% /  57%

Single Random Doc Experiment:
1. Omnipage (no training, noise reduction using unpaper): 94% / 86%
3. Tesseract (trained on the same images, noise reduction with
unpaper): 91% / 69%

We understand that the best matching font should give the best
results.  We experimented with creating a training set out of the
original image.  The process involves creating the box file in
tesseract and then manually correcting the boxes, then running through
a Java program which extracts the characters out of the image and puts
them in a new image with correct spacing to solve the spacing
problem.  This gave us the 10%-15% improvements, but still fell short
on the word error rate level.

Questions:
1. How do we find the best font matching the original document?
2. Is there a tool to help automate the font matching step?
2. Are there font repositories for Tesseract?

Thank you,

Ivan

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Font Repository

Reply via email to