By default, Tika is configured not to extract embedded images from PDFs because in some edge cases, there can be thousands of images in some small PDF files (see https://issues.apache.org/jira/browse/TIKA-1294). Our choice to have the default be “don’t extract” was based on the concern that if we made the change, devops folks in large document processing pipelines might be surprised by memory consumption and far slower parsing.
To configure Tika to extract embedded images, you can configure a PDFParserConfig (setExtractInlineImages(true)) and attach that to a ParseContext before the parse, or (if you are just using tika-app) you can set that value manually in in the app jar in o.a.t.parser.pdf.PDFParser.properties. I’m haven’t tested whether our OCR parser will process those embedded images, but it should. Let me know if this helps. From: Stefan Alder [mailto:[email protected]] Sent: Wednesday, May 13, 2015 3:04 PM To: [email protected] Subject: Embedded images in PDF - detect, extract and/or OCR Ultimately I'm trying to (1) determine whether images, particularly, full page images, are embedded in a pdf, and (2) extract the images and/or (3) OCR the text. Does tika-app support this? When I run java -jar tika-app-1.8.jar test.pdf, I get all of the meta data, and see <page></page> tags but no images. Running with -z doesn't output any images.
