RE: Embedded images in PDF - detect, extract and/or OCR

Allison, Timothy B. Wed, 13 May 2015 12:50:55 -0700

Hi Stefan,


1)      Right, out of the box, tika-app does not provide information about 
whether an embedded/inline image exists.  It will handle “attached” images as 
all other parsers do out of the box, but not embedded/inline images.

2)      Disabled for inline images, but not for regular attachments.

3)      At this point, no.  One hack is to unzip the app jar and just change 
the values in the properties file and rezip the jar.  On the horizon, I’d like 
to make a common interface for parser configuration so that you can set parser 
config parameters via the regular tika config file, and then you’d be able to 
specify that at the commandline.


If you do change the properties file, you’ll probably also want to change 
extractUniqueInlineImagesOnly
to “false”.

Cheers,

          Tim

From: Stefan Alder [mailto:[email protected]]
Sent: Wednesday, May 13, 2015 3:30 PM
To: [email protected]
Subject: Re: Embedded images in PDF - detect, extract and/or OCR

To clarify,
(1) tika-app, as compiled, does not provide any indication that an image exists 
within a pdf? (my main interest are entire page images for PDFs that were 
scanned). Again, my first interest is detecting whether embedded images exist.
(2) the -z option is effectively disabled for PDFs?
(3) is there a way to enable detection and/or extraction from the command line, 
as opposed to editing the source?



On Wed, May 13, 2015 at 12:18 PM, Allison, Timothy B. 
<[email protected]<mailto:[email protected]>> wrote:
By default, Tika is configured not to extract embedded images from PDFs because 
in some edge cases, there can be thousands of images in some small PDF files 
(see https://issues.apache.org/jira/browse/TIKA-1294).  Our choice to have the 
default be “don’t extract” was based on the concern that if we made the change, 
devops folks in large document processing pipelines might be surprised by 
memory consumption and far slower parsing.

To configure Tika to extract embedded images, you can configure a 
PDFParserConfig (setExtractInlineImages(true)) and attach that to a 
ParseContext before the parse, or (if you are just using tika-app) you can set 
that value manually in in the app jar in o.a.t.parser.pdf.PDFParser.properties.

I’m haven’t tested whether our OCR parser will process those embedded images, 
but it should.

Let me know if this helps.

From: Stefan Alder [mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, May 13, 2015 3:04 PM
To: [email protected]<mailto:[email protected]>
Subject: Embedded images in PDF - detect, extract and/or OCR

Ultimately I'm trying to (1) determine whether images, particularly, full page 
images, are embedded in a pdf, and (2) extract the images and/or (3) OCR the 
text.

Does tika-app support this?  When I run java -jar tika-app-1.8.jar test.pdf, I 
get all of the meta data, and see <page></page> tags but no images.

Running with -z doesn't output any images.

RE: Embedded images in PDF - detect, extract and/or OCR

Reply via email to