disable extraction of images

ron.vandenbranden Wed, 13 Apr 2016 03:52:46 -0700

Hi,

I've just happily discovered Tika and am sorting out how well it fitsour needs.

I'm trying to create a searchable index for PDF files that contain typedpages and pages with scanned text facsimile's. Some of those facsimile'sare scans from print source materials, in which case Tika seems to beable to index their text contents as well. Impressive though that is,we're currently only interested in the actual text content in the PDF;not the content on the images in the PDF.

Is it possible to disable text extraction from images inside a PDF file?I'm testing with the CLI tika app, which has "extractInlineImages" setto false by default, if I'm not mistaken. Yet, the text of the imagesstill is present in the generated HTML output. Am I missing somethingobvious?


Kind regards,

Ron

disable extraction of images

Reply via email to