Please forgive me if I make anything easy sound confusing. I am working with transcribing paper scanned pdf's. I can choose to either copy the page using pdfimages within xpdf, giving me the pdf image element or I can use ImageMagick to make a "page copy" of the pdf.
If I use pdfimages, the resulting document is to my knowledge the most original version of the paper scan but looks squashed vertically for an unknown reason to me. Like so in low res: http://s18.postimg.org/592fjf4rt/pdfimagesoutput.png The line I want ends up looking like: http://s12.postimg.org/l4q92xonx/title.png If I "convert file.pdf file.png" then the proportions look good, the resolution looks good, and the tesseract output is correct (http://s22.postimg.org/h4b3tevht/workingtitle.png) but I would feel more in control of the process if I could use the output from pdfimages to get a working transcription. What possible transformations of the pdfimages output could you suggest to improve my tesseract output? I can see that the working image is twice as tall and I have transformed to increase the size of the text using ImageMagick "convert -resize 200% title.png output.png" with minor improvement. The original document I am using can be found at: http://www.muni.org/Departments/finance/controller/CAFR/2011%20CAFR%20Financial%20Section.pdf Thank you. Hans Thompson -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

