Re: OCR on PDFs

Nick Burch Thu, 31 Dec 2020 07:53:23 -0800

On Thu, 31 Dec 2020, Peter Kronenberg wrote:

I've got Tika working with Tesseract on PDF files, but it seems that ifI give it a PDF file that has both searchable text and images, the textis OCRed twice.

Is this a PDF where some other tool has already done the OCR and storedthe text it found behind the image?

If you highlight the image in Acrobat Reader, does it manage to selectsome text? If you copy and paste do you get text out?

Does this PDF have a mixture of "normal" text and images containing text,or is it all just "image text"?



Answers to these will affect how much Tika can help / be configured!

Thanks
Nick

Re: OCR on PDFs

Reply via email to