[tesseract-ocr] Re: Improving accuracy of printed pages from 1934

corey . wallis Sat, 02 Apr 2016 21:16:36 -0700

Hi All,

Many thanks to those who have replied to my question here on the group, and 
privately.


It has given us some avenues to explore in extracting and preserving this 
information. 

I remain impressed by everyone who has contributed to the project and its 
capabilities. 

With thanks. 

-Corey

On Wednesday, 30 March 2016 02:51:51 UTC+10:30, Tom Morris wrote:
>
> Great to see someone using Tesseract to preserve a little history! 
>
> The first thing you should do is start with as close to the original as 
> possible.  Since you're working with this scan: 
> https://archive.org/details/filmdailyyearboo00film_4
> that would be the zip containing the original JPEG2000 images: 
> https://archive.org/download/filmdailyyearboo00film_4/filmdailyyearboo00film_4_jp2.zip
>
> Note that the Internet Archive runs all uploads through ABBY FineReader 
> and the output from that is available here: 
> https://archive.org/download/filmdailyyearboo00film_4/filmdailyyearboo00film_4_abbyy.gz
> Similar to Tesseracts hOCR output it includes coordinates for all text 
> blocks, so if it messed up the page segmentation it should be possible to 
> post-process to reconstruct the correct flow.  You can find an ABBY parser 
> that I wrote for another purpose here: 
> https://github.com/tfmorris/oed/blob/master/oedabby.py
>
> If you want to run things through Tesseract to compare for better quality 
> (or just for the fun of it), you should be able to do that directly if your 
> copy of Tesseract was built against a version of Leptonica with JPEG2000 
> support (mine was). I used this command to produce the attached output.
>
> $ tesseract filmdailyyearboo00film_4_0742.jp2 pg738 hocr
>
> Not surprisingly, Tesseract doesn't get the page segmentation correct. 
>  You could either preprocess to cut the image into four columns that you 
> OCR separately or post-process the hOCR output to put all the words in the 
> correct order.
>
> When I manually crop to just the first column, I get pretty reasonable (to 
> my eye) results. Files attached.
>
> Tom
>
>
> On Tuesday, March 29, 2016 at 2:29:27 AM UTC-4, [email protected] 
> wrote:
>>
>> Hi All,
>>
>> I've been experimenting with tesseract and have been impressed with the 
>> accuracy of the software. I'm looking to use tesseract to process around 
>> 200 pages of printed material that was printed in around 1934. I've 
>> attached a sample of the PDF I need to work with. 
>>
>> I'm looking to improve the accuracy of the OCR process as much as 
>> possible. I believe that with the vast, and I admit intimidating, list of 
>> options available that there are ways to improve the accuracy. Speed of 
>> recognition isn't as high a factor as accuracy for this project. 
>>
>> The following steps is what I've found works best so far:
>>
>> 1. Convert the PDF to TIFF
>>
>> convert -density 350 input.pdf -type Grayscale -background white +matte 
>> -depth 32 input.tif
>>
>>
>> 2. Clean the TIFF file using the text cleaner script [1]
>>
>> textcleaner -t 25 -s 1 -g input.tif cleaned.tif
>>
>>
>> 3. OCR the cleaned TIFF file.
>>
>> tesseract cleaned.tif ./test-ocr
>>
>>
>> Any thoughts on ways to improve the accuracy will be gratefully received. 
>>
>>
>> With thanks. 
>>
>>
>> -Corey
>>
>>
>> [1] http://www.fmwconcepts.com/imagemagick/textcleaner/
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/72816949-6fe5-4c95-bf5e-8cd84f24b015%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Improving accuracy of printed pages from 1934

Reply via email to