Tesseract reads only image files, not pdf. You can convert PDF to image (tif, png) and OCR those.
Or use wrappers that use tesseract.which take a PDF and convert to text. Look under add-ons in wiki. On Mon, Jan 13, 2020, 00:31 'pjfarley3' via tesseract-ocr < [email protected]> wrote: > I installed the 64-bit version of tesseract from UB Mannheim on my Win10 > system but it will not read a PDF as the input "image". > > Error messages: > > Tesseract Open Source OCR Engine v5.0.0-alpha.20191030 with Leptonica > Error in pixReadStream: Pdf reading is not supported > Error in pixRead: pix not read > Error during processing. > > I have tried using the Xpdf command-line tool pdftotext for this task, but > even the latest V4.02 of pdftotext fails to process some apparently invalid > character maps (both LATIN1 and utf-8) for some PDF's I need converted to > text. > > The PDF's are generated by a third party that I have no influence over to > correct their PDF mistakes. > > I was hoping tesseract might do a better job for my PDF-to-text need. > > TIA for any info or suggestions you can provide. > > Peter > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/3acec554-e508-4759-8a46-9ab7e1bb6e6f%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/3acec554-e508-4759-8a46-9ab7e1bb6e6f%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXgUjgi5uZhQrAorCs58o6ZXVzDWrFoMo9endxvadfkhg%40mail.gmail.com.

