[tesseract-ocr] Can tesseract be used to read a PDF and OCR it to text?

'pjfarley3' via tesseract-ocr Sun, 12 Jan 2020 11:02:09 -0800

I installed the 64-bit version of tesseract from UB Mannheim on my Win10 
system but it will not read a PDF as the input "image".


Error messages:

Tesseract Open Source OCR Engine v5.0.0-alpha.20191030 with Leptonica
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.

I have tried using the Xpdf command-line tool pdftotext for this task, but 
even the latest V4.02 of pdftotext fails to process some apparently invalid 
character maps (both LATIN1 and utf-8) for some PDF's I need converted to 
text.

The PDF's are generated by a third party that I have no influence over to 
correct their PDF mistakes.

I was hoping tesseract might do a better job for my PDF-to-text need.

TIA for any info or suggestions you can provide.

Peter

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3acec554-e508-4759-8a46-9ab7e1bb6e6f%40googlegroups.com.

[tesseract-ocr] Can tesseract be used to read a PDF and OCR it to text?

Reply via email to