Please forgive me if I make anything easy sound confusing.  

I am working with transcribing paper scanned pdf's.  I can choose to either 
copy the page using pdfimages within xpdf, giving me the pdf image element 
or I can use ImageMagick to make a "page copy" of the pdf. 

If I use pdfimages, the resulting document is to my knowledge the most 
original version of the paper scan but looks squashed vertically for an 
unknown reason to me.  Like so in low res: 
http://s18.postimg.org/592fjf4rt/pdfimagesoutput.png  The line I want ends 
up looking like: http://s12.postimg.org/l4q92xonx/title.png

If I "convert file.pdf file.png" then the proportions look good, the 
resolution looks good, and the tesseract output is correct 
(http://s22.postimg.org/h4b3tevht/workingtitle.png) but I would feel more 
in control of the process if I could use the output from pdfimages to get a 
working transcription. 

What possible transformations of the pdfimages output could you suggest to 
improve my tesseract output? 

I can see that the working image is twice as tall and I have transformed to 
increase the size of the text using ImageMagick "convert -resize 200% 
title.png output.png" with minor improvement. 

The original document I am using can be found at: 
http://www.muni.org/Departments/finance/controller/CAFR/2011%20CAFR%20Financial%20Section.pdf

Thank you. 

Hans Thompson

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to