Re: Extract PDF inline images

Andrea Asta Tue, 07 Jul 2015 03:22:30 -0700

Hi Tim,
thanks for your response, but I can't find a complete solution.

I've created a class using the same FileEmbeddedDocumentExtractor from
TikaCLI, and now I'm trying to do a sample main program with a PDF
containing some images.
This is my code, but I can't have any image stored and the methods of
DocumentExtractor are never called using debugger.
Thanks
Andrea


RecursiveParserWrapper parser = new RecursiveParserWrapper(
      new AutoDetectParser(),
      new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();

FileEmbeddedDocumentExtractor extractor = new
FileEmbeddedDocumentExtractor();
context.set(FileEmbeddedDocumentExtractor.class, extractor);

PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(true);
context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);

context.set(org.apache.tika.parser.Parser.class, new AutoDetectParser());

InputStream is = PdfRecursiveExample.class.getResourceAsStream("/my.PDF");
ToXMLContentHandler handler = new ToXMLContentHandler(new
FileOutputStream(new File("out.txt")), "UTF-8");
parser.parse(is, handler, metadata, context);

2015-07-06 12:59 GMT+02:00 Allison, Timothy B. <[email protected]>:

>  Hi Andrea,
>
>   The RecursiveParserWrapper, as you found, is only for extracted content and 
> metadata.   It was designed to cache metadata and content from embedded 
> documents so that you can easily keep those two things together for each 
> embedded document.
>
>   To extract the raw bytes from embedded files, try implementing an 
> EmbeddedDocumentExtractor and passing that into the ParseContext.  Take a 
> look at 
> http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
>  and specifically the inner class MyEmbeddedDocument extractor for an 
> example.  As another example, look at 
> http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java,
>  and specifically the inner class: FileEmbeddedDocumentExtractor
>
>
>
>
>
> Basically, in ParseEmbedded, just copy the InputStream to a
> FileOutputStream, and you should be good to go.
>
>
>
> *public boolean *shouldParseEmbedded(Metadata metadata) {
>     *return true*;
> }
>
> *public void *parseEmbedded(InputStream inputStream, ContentHandler
> contentHandler, Metadata metadata, *boolean *b) *throws *SAXException,
> IOException {
>
>
>
>       Best,
>
>
>
>                    Tim
>
>
>
> *From:* Andrea Asta [mailto:[email protected]]
> *Sent:* Monday, July 06, 2015 6:11 AM
> *To:* [email protected]
> *Subject:* Extract PDF inline images
>
>
>
> Hello,
>
> I'm trying to store the inline images from a PDF to a local folder, but
> can't find any valid example. I can only use the RecursiveParserWrapper to
> get all the available metadata, but not the binary image content.
>
> This is my code:
>
> RecursiveParserWrapper parser = new RecursiveParserWrapper(
>       new AutoDetectParser(),
>       new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
> );
> Metadata metadata = new Metadata();
> ParseContext context = new ParseContext();
> PDFParserConfig config = new PDFParserConfig();
> PDFParser p;
> config.setExtractInlineImages(true);
> config.setExtractUniqueInlineImagesOnly(false);
> context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
> context.set(org.apache.tika.parser.Parser.class, parser);
>
> InputStream is =
> PdfRecursiveExample.class.getResourceAsStream("/BA200PDE.PDF");
> //parsing the file
> ToXMLContentHandler handler = new ToXMLContentHandler(new
> FileOutputStream(new File("out.txt")), "UTF-8");
> parser.parse(is, handler, metadata, context);
>
> How can I store each image file to a folder?
>
> Thanks
>
> Andrea
>

Re: Extract PDF inline images

Reply via email to