Re: Embedded images in PDF - detect, extract and/or OCR

Stefan Alder Wed, 13 May 2015 12:31:37 -0700

To clarify,
(1) tika-app, as compiled, does not provide any indication that an image
exists within a pdf? (my main interest are entire page images for PDFs that
were scanned). Again, my first interest is detecting whether embedded
images exist.
(2) the -z option is effectively disabled for PDFs?
(3) is there a way to enable detection and/or extraction from the command
line, as opposed to editing the source?




On Wed, May 13, 2015 at 12:18 PM, Allison, Timothy B. <[email protected]>
wrote:

>  By default, Tika is configured not to extract embedded images from PDFs
> because in some edge cases, there can be thousands of images in some small
> PDF files (see https://issues.apache.org/jira/browse/TIKA-1294).  Our
> choice to have the default be “don’t extract” was based on the concern that
> if we made the change, devops folks in large document processing pipelines
> might be surprised by memory consumption and far slower parsing.
>
>
>
> To configure Tika to extract embedded images, you can configure a
> PDFParserConfig (setExtractInlineImages(true)) and attach that to a
> ParseContext before the parse, or (if you are just using tika-app) you can
> set that value manually in in the app jar in
> o.a.t.parser.pdf.PDFParser.properties.
>
>
>
> I’m haven’t tested whether our OCR parser will process those embedded
> images, but it should.
>
>
>
> Let me know if this helps.
>
>
>
> *From:* Stefan Alder [mailto:[email protected]]
> *Sent:* Wednesday, May 13, 2015 3:04 PM
> *To:* [email protected]
> *Subject:* Embedded images in PDF - detect, extract and/or OCR
>
>
>
> Ultimately I'm trying to (1) determine whether images, particularly, full
> page images, are embedded in a pdf, and (2) extract the images and/or (3)
> OCR the text.
>
>
>
> Does tika-app support this?  When I run java -jar tika-app-1.8.jar
> test.pdf, I get all of the meta data, and see <page></page> tags but no
> images.
>
>
>
> Running with -z doesn't output any images.
>
>
>
>
>

Re: Embedded images in PDF - detect, extract and/or OCR

Reply via email to