On Mon, Jan 04, 2010 at 11:24:54AM -0800, nguyenq wrote: > > Is there a standard way to extract text from PDF using tesseract-ocr ? > > No, you would have to convert PDF to an image before feeding it to the > OCR engine. Ghostscript supports such PDF conversion tasks.
I would not recommend that, as it resamples the image. The pdfimages program extracts raster images from PDF. These you can then feed to tesseract. The text is actually stored as text, rather than as images, then pdftotext will extract the text.
signature.asc
Description: Digital signature

