Hi, Some scanning software include OCR features and include hidden text behind the scanned images to make the resulting PDF searchable. I suspect this may be happening in your case.
It would be technically possible to detect such hidden text and have an option for excluding it from the output, but IIRC such a feature doesn't currently exist in Tika or the underlying PDFBox library. Best, Jukka Zitting On Wed, Apr 13, 2016 at 8:52 AM ron.vandenbranden < [email protected]> wrote: > Hi again, > > > On 13/04/2016 13:18, ron.vandenbranden wrote: > > > I wasn't aware of tesseract; I definitely don't have it on my classpath. > I'm just testing with the stand-alone tika jar file. My Java skills are > close to zero (apart from copy/paste and recompiling things). Could you > tell me how to configure this for the standalone jar file, please? > > > Ok, answering my own question: per the documentation at > https://tika.apache.org/1.12/gettingstarted.html, I got the CLI app > working with a configuration file with following command line arguments: > > java -jar tika-app-1.12.jar --gui --config=tika-config.xml > > I'm using the example configuration file from > https://wiki.apache.org/tika/TikaOCR#Disable_Tika_OCR, excluding the > TesseractOCRParser. > > Yet, this does not seem to change anything: the image content is still > extracted. Any idea what could be wrong? > > Best, > > Ron > <http://www.facebook.com/KANTL.be> >
