Hi All,

Many thanks to those who have replied to my question here on the group, and 
privately.

It has given us some avenues to explore in extracting and preserving this 
information. 

I remain impressed by everyone who has contributed to the project and its 
capabilities. 

With thanks. 

-Corey

On Wednesday, 30 March 2016 02:51:51 UTC+10:30, Tom Morris wrote:
>
> Great to see someone using Tesseract to preserve a little history! 
>
> The first thing you should do is start with as close to the original as 
> possible.  Since you're working with this scan: 
> https://archive.org/details/filmdailyyearboo00film_4
> that would be the zip containing the original JPEG2000 images: 
> https://archive.org/download/filmdailyyearboo00film_4/filmdailyyearboo00film_4_jp2.zip
>
> Note that the Internet Archive runs all uploads through ABBY FineReader 
> and the output from that is available here: 
> https://archive.org/download/filmdailyyearboo00film_4/filmdailyyearboo00film_4_abbyy.gz
> Similar to Tesseracts hOCR output it includes coordinates for all text 
> blocks, so if it messed up the page segmentation it should be possible to 
> post-process to reconstruct the correct flow.  You can find an ABBY parser 
> that I wrote for another purpose here: 
> https://github.com/tfmorris/oed/blob/master/oedabby.py
>
> If you want to run things through Tesseract to compare for better quality 
> (or just for the fun of it), you should be able to do that directly if your 
> copy of Tesseract was built against a version of Leptonica with JPEG2000 
> support (mine was). I used this command to produce the attached output.
>
> $ tesseract filmdailyyearboo00film_4_0742.jp2 pg738 hocr
>
> Not surprisingly, Tesseract doesn't get the page segmentation correct. 
>  You could either preprocess to cut the image into four columns that you 
> OCR separately or post-process the hOCR output to put all the words in the 
> correct order.
>
> When I manually crop to just the first column, I get pretty reasonable (to 
> my eye) results. Files attached.
>
> Tom
>
>
> On Tuesday, March 29, 2016 at 2:29:27 AM UTC-4, [email protected] 
> wrote:
>>
>> Hi All,
>>
>> I've been experimenting with tesseract and have been impressed with the 
>> accuracy of the software. I'm looking to use tesseract to process around 
>> 200 pages of printed material that was printed in around 1934. I've 
>> attached a sample of the PDF I need to work with. 
>>
>> I'm looking to improve the accuracy of the OCR process as much as 
>> possible. I believe that with the vast, and I admit intimidating, list of 
>> options available that there are ways to improve the accuracy. Speed of 
>> recognition isn't as high a factor as accuracy for this project. 
>>
>> The following steps is what I've found works best so far:
>>
>> 1. Convert the PDF to TIFF
>>
>> convert -density 350 input.pdf -type Grayscale -background white +matte 
>> -depth 32 input.tif
>>
>>
>> 2. Clean the TIFF file using the text cleaner script [1]
>>
>> textcleaner -t 25 -s 1 -g input.tif cleaned.tif
>>
>>
>> 3. OCR the cleaned TIFF file.
>>
>> tesseract cleaned.tif ./test-ocr
>>
>>
>> Any thoughts on ways to improve the accuracy will be gratefully received. 
>>
>>
>> With thanks. 
>>
>>
>> -Corey
>>
>>
>> [1] http://www.fmwconcepts.com/imagemagick/textcleaner/
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/72816949-6fe5-4c95-bf5e-8cd84f24b015%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to