RE: OCR on PDFs

2021-01-04 Thread Peter Kronenberg
Sent: Monday, January 4, 2021 11:11 AM To: user@tika.apache.org Subject: Re: OCR on PDFs Sorry for not responding sooner. The file that you attached helps me understand this question quite a bit. The basic answer is: no, not yet, not generally. The correct way to do OCR on PDFs might

Re: OCR on PDFs

2020-12-31 Thread Nick Burch
On Thu, 31 Dec 2020, Peter Kronenberg wrote: I've got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice. Is this a PDF where some other tool has already done the OCR and stored the text it

OCR on PDFs

2020-12-31 Thread Peter Kronenberg
I've got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice. Is there a way to avoid this? Even if it has to make two passes, one for the straight text and then another for just the images