Re: Extracting text from PDF

Jeffrey Ratcliffe Mon, 04 Jan 2010 14:28:05 -0800

On Mon, Jan 04, 2010 at 11:24:54AM -0800, nguyenq wrote:
> > Is there a standard way to extract text from PDF using tesseract-ocr ?
>
> No, you would have to convert PDF to an image before feeding it to the
> OCR engine. Ghostscript supports such PDF conversion tasks.


I would not recommend that, as it resamples the image. The pdfimages
program extracts raster images from PDF. These you can then feed to tesseract.

The text is actually stored as text, rather than as images, then
pdftotext will extract the text.

signature.asc
Description: Digital signature

Re: Extracting text from PDF

Reply via email to