Hi Stefan,
1) Right, out of the box, tika-app does not provide information about
whether an embedded/inline image exists. It will handle “attached” images as
all other parsers do out of the box, but not embedded/inline images.
2) Disabled for inline images, but not for regular attachments.
3) At this point, no. One hack is to unzip the app jar and just change
the values in the properties file and rezip the jar. On the horizon, I’d like
to make a common interface for parser configuration so that you can set parser
config parameters via the regular tika config file, and then you’d be able to
specify that at the commandline.
If you do change the properties file, you’ll probably also want to change
extractUniqueInlineImagesOnly
to “false”.
Cheers,
Tim
From: Stefan Alder [mailto:[email protected]]
Sent: Wednesday, May 13, 2015 3:30 PM
To: [email protected]
Subject: Re: Embedded images in PDF - detect, extract and/or OCR
To clarify,
(1) tika-app, as compiled, does not provide any indication that an image exists
within a pdf? (my main interest are entire page images for PDFs that were
scanned). Again, my first interest is detecting whether embedded images exist.
(2) the -z option is effectively disabled for PDFs?
(3) is there a way to enable detection and/or extraction from the command line,
as opposed to editing the source?
On Wed, May 13, 2015 at 12:18 PM, Allison, Timothy B.
<[email protected]<mailto:[email protected]>> wrote:
By default, Tika is configured not to extract embedded images from PDFs because
in some edge cases, there can be thousands of images in some small PDF files
(see https://issues.apache.org/jira/browse/TIKA-1294). Our choice to have the
default be “don’t extract” was based on the concern that if we made the change,
devops folks in large document processing pipelines might be surprised by
memory consumption and far slower parsing.
To configure Tika to extract embedded images, you can configure a
PDFParserConfig (setExtractInlineImages(true)) and attach that to a
ParseContext before the parse, or (if you are just using tika-app) you can set
that value manually in in the app jar in o.a.t.parser.pdf.PDFParser.properties.
I’m haven’t tested whether our OCR parser will process those embedded images,
but it should.
Let me know if this helps.
From: Stefan Alder [mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, May 13, 2015 3:04 PM
To: [email protected]<mailto:[email protected]>
Subject: Embedded images in PDF - detect, extract and/or OCR
Ultimately I'm trying to (1) determine whether images, particularly, full page
images, are embedded in a pdf, and (2) extract the images and/or (3) OCR the
text.
Does tika-app support this? When I run java -jar tika-app-1.8.jar test.pdf, I
get all of the meta data, and see <page></page> tags but no images.
Running with -z doesn't output any images.