Sent: Monday, January 4, 2021 11:11 AM
To: user@tika.apache.org
Subject: Re: OCR on PDFs
Sorry for not responding sooner. The file that you attached helps me
understand this question quite a bit.
The basic answer is: no, not yet, not generally. The correct way to do OCR on
PDFs might
On Thu, 31 Dec 2020, Peter Kronenberg wrote:
I've got Tika working with Tesseract on PDF files, but it seems that if
I give it a PDF file that has both searchable text and images, the text
is OCRed twice.
Is this a PDF where some other tool has already done the OCR and stored
the text it
I've got Tika working with Tesseract on PDF files, but it seems that if I give
it a PDF file that has both searchable text and images, the text is OCRed
twice. Is there a way to avoid this? Even if it has to make two passes, one
for the straight text and then another for just the images