We have some old, relatively low image quality content (1800's periodicals). We compared Tesseract vs Omnipage using some random 30 page samples from different publications. Here are the results (Character Error Rate / Word Error Rate):
30 docs, out-of-the-box, no tuning: 1. Omnipage: 92% / 83% 2. Tesseract: 74% / 57% Single Random Doc Experiment: 1. Omnipage (no training, noise reduction using unpaper): 94% / 86% 3. Tesseract (trained on the same images, noise reduction with unpaper): 91% / 69% We understand that the best matching font should give the best results. We experimented with creating a training set out of the original image. The process involves creating the box file in tesseract and then manually correcting the boxes, then running through a Java program which extracts the characters out of the image and puts them in a new image with correct spacing to solve the spacing problem. This gave us the 10%-15% improvements, but still fell short on the word error rate level. Questions: 1. How do we find the best font matching the original document? 2. Is there a tool to help automate the font matching step? 2. Are there font repositories for Tesseract? Thank you, Ivan -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

