Hi,
Some scanning software include OCR features and include hidden text behind
the scanned images to make the resulting PDF searchable. I suspect this may
be happening in your case.
It would be technically possible to detect such hidden text and have an
option for excluding it from the output, bu
Hi again,
On 13/04/2016 13:18, ron.vandenbranden wrote:
I wasn't aware of tesseract; I definitely don't have it on my
classpath. I'm just testing with the stand-alone tika jar file. My
Java skills are close to zero (apart from copy/paste and recompiling
things). Could you tell me how to conf
Thanks,
I wasn't aware of tesseract; I definitely don't have it on my classpath.
I'm just testing with the stand-alone tika jar file. My Java skills are
close to zero (apart from copy/paste and recompiling things). Could you
tell me how to configure this for the standalone jar file, please?
On Wed, 13 Apr 2016, ron.vandenbranden wrote:
Is it possible to disable text extraction from images inside a PDF file?
I'm testing with the CLI tika app, which has "extractInlineImages" set
to false by default, if I'm not mistaken. Yet, the text of the images
still is present in the generated H
Hi,
I've just happily discovered Tika and am sorting out how well it fits
our needs.
I'm trying to create a searchable index for PDF files that contain typed
pages and pages with scanned text facsimile's. Some of those facsimile's
are scans from print source materials, in which case Tika see