Re: disable extraction of images

ron.vandenbranden Wed, 13 Apr 2016 04:18:12 -0700

Thanks,

I wasn't aware of tesseract; I definitely don't have it on my classpath.I'm just testing with the stand-alone tika jar file. My Java skills areclose to zero (apart from copy/paste and recompiling things). Could youtell me how to configure this for the standalone jar file, please?

In the end, I'll be using Tika embedded in another app (the eXist nativeXML database), which uses 2 jars: tika-core and tika-parsers. How wouldI have to go about to disable tesseract there?


Apologies for the low-level questions, any help much appreciated!

Best,

Ron


On 13/04/2016 12:56, Nick Burch wrote:

On Wed, 13 Apr 2016, ron.vandenbranden wrote:
Is it possible to disable text extraction from images inside a PDFfile? I'm testing with the CLI tika app, which has"extractInlineImages" set to false by default, if I'm not mistaken.Yet, the text of the images still is present in the generated HTMLoutput. Am I missing something obvious?
Yup, see "Disable Tika OCR" in https://wiki.apache.org/tika/TikaOCR(or remove tessaract from your path!)
Nick

Re: disable extraction of images

Reply via email to