On Thu, 31 Dec 2020, Peter Kronenberg wrote:
I've got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice.
Is this a PDF where some other tool has already done the OCR and stored the text it found behind the image?
If you highlight the image in Acrobat Reader, does it manage to select some text? If you copy and paste do you get text out?
Does this PDF have a mixture of "normal" text and images containing text, or is it all just "image text"?
Answers to these will affect how much Tika can help / be configured! Thanks Nick
