Hi All, I've been experimenting with tesseract and have been impressed with the accuracy of the software. I'm looking to use tesseract to process around 200 pages of printed material that was printed in around 1934. I've attached a sample of the PDF I need to work with.
I'm looking to improve the accuracy of the OCR process as much as possible. I believe that with the vast, and I admit intimidating, list of options available that there are ways to improve the accuracy. Speed of recognition isn't as high a factor as accuracy for this project. The following steps is what I've found works best so far: 1. Convert the PDF to TIFF convert -density 350 input.pdf -type Grayscale -background white +matte -depth 32 input.tif 2. Clean the TIFF file using the text cleaner script [1] textcleaner -t 25 -s 1 -g input.tif cleaned.tif 3. OCR the cleaned TIFF file. tesseract cleaned.tif ./test-ocr Any thoughts on ways to improve the accuracy will be gratefully received. With thanks. -Corey [1] http://www.fmwconcepts.com/imagemagick/textcleaner/ -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b0f8c233-9c92-4bf0-b994-5a5cc189b0e0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Pages from 1934 filmdailyyearboo00film_4.pdf
Description: Adobe PDF document

