Tesseract reads only image files, not pdf. You can convert PDF to image
(tif, png) and OCR those.

Or use wrappers that use tesseract.which take a PDF and convert to text.
Look under add-ons in wiki.

On Mon, Jan 13, 2020, 00:31 'pjfarley3' via tesseract-ocr <
[email protected]> wrote:

> I installed the 64-bit version of tesseract from UB Mannheim on my Win10
> system but it will not read a PDF as the input "image".
>
> Error messages:
>
> Tesseract Open Source OCR Engine v5.0.0-alpha.20191030 with Leptonica
> Error in pixReadStream: Pdf reading is not supported
> Error in pixRead: pix not read
> Error during processing.
>
> I have tried using the Xpdf command-line tool pdftotext for this task, but
> even the latest V4.02 of pdftotext fails to process some apparently invalid
> character maps (both LATIN1 and utf-8) for some PDF's I need converted to
> text.
>
> The PDF's are generated by a third party that I have no influence over to
> correct their PDF mistakes.
>
> I was hoping tesseract might do a better job for my PDF-to-text need.
>
> TIA for any info or suggestions you can provide.
>
> Peter
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3acec554-e508-4759-8a46-9ab7e1bb6e6f%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/3acec554-e508-4759-8a46-9ab7e1bb6e6f%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXgUjgi5uZhQrAorCs58o6ZXVzDWrFoMo9endxvadfkhg%40mail.gmail.com.

Reply via email to