Thanks,
I wasn't aware of tesseract; I definitely don't have it on my classpath.
I'm just testing with the stand-alone tika jar file. My Java skills are
close to zero (apart from copy/paste and recompiling things). Could you
tell me how to configure this for the standalone jar file, please?
In the end, I'll be using Tika embedded in another app (the eXist native
XML database), which uses 2 jars: tika-core and tika-parsers. How would
I have to go about to disable tesseract there?
Apologies for the low-level questions, any help much appreciated!
Best,
Ron
On 13/04/2016 12:56, Nick Burch wrote:
On Wed, 13 Apr 2016, ron.vandenbranden wrote:
Is it possible to disable text extraction from images inside a PDF
file? I'm testing with the CLI tika app, which has
"extractInlineImages" set to false by default, if I'm not mistaken.
Yet, the text of the images still is present in the generated HTML
output. Am I missing something obvious?
Yup, see "Disable Tika OCR" in https://wiki.apache.org/tika/TikaOCR
(or remove tessaract from your path!)
Nick