Hi,
I've just happily discovered Tika and am sorting out how well it fits
our needs.
I'm trying to create a searchable index for PDF files that contain typed
pages and pages with scanned text facsimile's. Some of those facsimile's
are scans from print source materials, in which case Tika seems to be
able to index their text contents as well. Impressive though that is,
we're currently only interested in the actual text content in the PDF;
not the content on the images in the PDF.
Is it possible to disable text extraction from images inside a PDF file?
I'm testing with the CLI tika app, which has "extractInlineImages" set
to false by default, if I'm not mistaken. Yet, the text of the images
still is present in the generated HTML output. Am I missing something
obvious?
Kind regards,
Ron
- disable extraction of images ron.vandenbranden
-