Hi All,

I've been experimenting with tesseract and have been impressed with the 
accuracy of the software. I'm looking to use tesseract to process around 
200 pages of printed material that was printed in around 1934. I've 
attached a sample of the PDF I need to work with. 

I'm looking to improve the accuracy of the OCR process as much as possible. 
I believe that with the vast, and I admit intimidating, list of options 
available that there are ways to improve the accuracy. Speed of recognition 
isn't as high a factor as accuracy for this project. 

The following steps is what I've found works best so far:

1. Convert the PDF to TIFF

convert -density 350 input.pdf -type Grayscale -background white +matte 
-depth 32 input.tif


2. Clean the TIFF file using the text cleaner script [1]

textcleaner -t 25 -s 1 -g input.tif cleaned.tif


3. OCR the cleaned TIFF file.

tesseract cleaned.tif ./test-ocr


Any thoughts on ways to improve the accuracy will be gratefully received. 


With thanks. 


-Corey


[1] http://www.fmwconcepts.com/imagemagick/textcleaner/

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b0f8c233-9c92-4bf0-b994-5a5cc189b0e0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Attachment: Pages from 1934 filmdailyyearboo00film_4.pdf
Description: Adobe PDF document

Reply via email to