Hi,

I've just happily discovered Tika and am sorting out how well it fits our needs.

I'm trying to create a searchable index for PDF files that contain typed pages and pages with scanned text facsimile's. Some of those facsimile's are scans from print source materials, in which case Tika seems to be able to index their text contents as well. Impressive though that is, we're currently only interested in the actual text content in the PDF; not the content on the images in the PDF.

Is it possible to disable text extraction from images inside a PDF file? I'm testing with the CLI tika app, which has "extractInlineImages" set to false by default, if I'm not mistaken. Yet, the text of the images still is present in the generated HTML output. Am I missing something obvious?

Kind regards,

Ron

Reply via email to