RE: Embedded images in PDF - detect, extract and/or OCR

Allison, Timothy B. Wed, 13 May 2015 12:19:48 -0700

By default, Tika is configured not to extract embedded images from PDFs because 
in some edge cases, there can be thousands of images in some small PDF files 
(see https://issues.apache.org/jira/browse/TIKA-1294).  Our choice to have the 
default be “don’t extract” was based on the concern that if we made the change, 
devops folks in large document processing pipelines might be surprised by 
memory consumption and far slower parsing.


To configure Tika to extract embedded images, you can configure a 
PDFParserConfig (setExtractInlineImages(true)) and attach that to a 
ParseContext before the parse, or (if you are just using tika-app) you can set 
that value manually in in the app jar in o.a.t.parser.pdf.PDFParser.properties.

I’m haven’t tested whether our OCR parser will process those embedded images, 
but it should.

Let me know if this helps.

From: Stefan Alder [mailto:[email protected]]
Sent: Wednesday, May 13, 2015 3:04 PM
To: [email protected]
Subject: Embedded images in PDF - detect, extract and/or OCR

Ultimately I'm trying to (1) determine whether images, particularly, full page 
images, are embedded in a pdf, and (2) extract the images and/or (3) OCR the 
text.

Does tika-app support this?  When I run java -jar tika-app-1.8.jar test.pdf, I 
get all of the meta data, and see <page></page> tags but no images.

Running with -z doesn't output any images.

RE: Embedded images in PDF - detect, extract and/or OCR

Reply via email to