To clarify, (1) tika-app, as compiled, does not provide any indication that an image exists within a pdf? (my main interest are entire page images for PDFs that were scanned). Again, my first interest is detecting whether embedded images exist. (2) the -z option is effectively disabled for PDFs? (3) is there a way to enable detection and/or extraction from the command line, as opposed to editing the source?
On Wed, May 13, 2015 at 12:18 PM, Allison, Timothy B. <[email protected]> wrote: > By default, Tika is configured not to extract embedded images from PDFs > because in some edge cases, there can be thousands of images in some small > PDF files (see https://issues.apache.org/jira/browse/TIKA-1294). Our > choice to have the default be “don’t extract” was based on the concern that > if we made the change, devops folks in large document processing pipelines > might be surprised by memory consumption and far slower parsing. > > > > To configure Tika to extract embedded images, you can configure a > PDFParserConfig (setExtractInlineImages(true)) and attach that to a > ParseContext before the parse, or (if you are just using tika-app) you can > set that value manually in in the app jar in > o.a.t.parser.pdf.PDFParser.properties. > > > > I’m haven’t tested whether our OCR parser will process those embedded > images, but it should. > > > > Let me know if this helps. > > > > *From:* Stefan Alder [mailto:[email protected]] > *Sent:* Wednesday, May 13, 2015 3:04 PM > *To:* [email protected] > *Subject:* Embedded images in PDF - detect, extract and/or OCR > > > > Ultimately I'm trying to (1) determine whether images, particularly, full > page images, are embedded in a pdf, and (2) extract the images and/or (3) > OCR the text. > > > > Does tika-app support this? When I run java -jar tika-app-1.8.jar > test.pdf, I get all of the meta data, and see <page></page> tags but no > images. > > > > Running with -z doesn't output any images. > > > > >
