Hi All, Many thanks to those who have replied to my question here on the group, and privately.
It has given us some avenues to explore in extracting and preserving this information. I remain impressed by everyone who has contributed to the project and its capabilities. With thanks. -Corey On Wednesday, 30 March 2016 02:51:51 UTC+10:30, Tom Morris wrote: > > Great to see someone using Tesseract to preserve a little history! > > The first thing you should do is start with as close to the original as > possible. Since you're working with this scan: > https://archive.org/details/filmdailyyearboo00film_4 > that would be the zip containing the original JPEG2000 images: > https://archive.org/download/filmdailyyearboo00film_4/filmdailyyearboo00film_4_jp2.zip > > Note that the Internet Archive runs all uploads through ABBY FineReader > and the output from that is available here: > https://archive.org/download/filmdailyyearboo00film_4/filmdailyyearboo00film_4_abbyy.gz > Similar to Tesseracts hOCR output it includes coordinates for all text > blocks, so if it messed up the page segmentation it should be possible to > post-process to reconstruct the correct flow. You can find an ABBY parser > that I wrote for another purpose here: > https://github.com/tfmorris/oed/blob/master/oedabby.py > > If you want to run things through Tesseract to compare for better quality > (or just for the fun of it), you should be able to do that directly if your > copy of Tesseract was built against a version of Leptonica with JPEG2000 > support (mine was). I used this command to produce the attached output. > > $ tesseract filmdailyyearboo00film_4_0742.jp2 pg738 hocr > > Not surprisingly, Tesseract doesn't get the page segmentation correct. > You could either preprocess to cut the image into four columns that you > OCR separately or post-process the hOCR output to put all the words in the > correct order. > > When I manually crop to just the first column, I get pretty reasonable (to > my eye) results. Files attached. > > Tom > > > On Tuesday, March 29, 2016 at 2:29:27 AM UTC-4, [email protected] > wrote: >> >> Hi All, >> >> I've been experimenting with tesseract and have been impressed with the >> accuracy of the software. I'm looking to use tesseract to process around >> 200 pages of printed material that was printed in around 1934. I've >> attached a sample of the PDF I need to work with. >> >> I'm looking to improve the accuracy of the OCR process as much as >> possible. I believe that with the vast, and I admit intimidating, list of >> options available that there are ways to improve the accuracy. Speed of >> recognition isn't as high a factor as accuracy for this project. >> >> The following steps is what I've found works best so far: >> >> 1. Convert the PDF to TIFF >> >> convert -density 350 input.pdf -type Grayscale -background white +matte >> -depth 32 input.tif >> >> >> 2. Clean the TIFF file using the text cleaner script [1] >> >> textcleaner -t 25 -s 1 -g input.tif cleaned.tif >> >> >> 3. OCR the cleaned TIFF file. >> >> tesseract cleaned.tif ./test-ocr >> >> >> Any thoughts on ways to improve the accuracy will be gratefully received. >> >> >> With thanks. >> >> >> -Corey >> >> >> [1] http://www.fmwconcepts.com/imagemagick/textcleaner/ >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/72816949-6fe5-4c95-bf5e-8cd84f24b015%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

