RE: Extract PDF inline images

Allison, Timothy B. Tue, 07 Jul 2015 07:15:46 -0700

Andrea,
  I’m about to commit an example (see TIKA-1674).  In about 10 minutes, look 
for org.apache.tika.example.ExtractEmbeddedFiles in the tika-examples module.
  I’m still a bit stumped though on why my example isn’t working recursively.  
It is only pulling out the children of the input document.  Stay tuned to 
TIKA-1674 for follow up on that.

           Best,

                          Tim

From: Andrea Asta [mailto:[email protected]]
Sent: Tuesday, July 07, 2015 6:22 AM
To: [email protected]
Subject: Re: Extract PDF inline images

Hi Tim,
thanks for your response, but I can't find a complete solution.
I've created a class using the same FileEmbeddedDocumentExtractor from TikaCLI, 
and now I'm trying to do a sample main program with a PDF containing some 
images.
This is my code, but I can't have any image stored and the methods of 
DocumentExtractor are never called using debugger.
Thanks
Andrea

RecursiveParserWrapper parser = new RecursiveParserWrapper(
      new AutoDetectParser(),
      new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();

FileEmbeddedDocumentExtractor extractor = new FileEmbeddedDocumentExtractor();
context.set(FileEmbeddedDocumentExtractor.class, extractor);

PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(true);
context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);

context.set(org.apache.tika.parser.Parser.class, new AutoDetectParser());

InputStream is = PdfRecursiveExample.class.getResourceAsStream("/my.PDF");
ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new 
File("out.txt")), "UTF-8");
parser.parse(is, handler, metadata, context);

2015-07-06 12:59 GMT+02:00 Allison, Timothy B. 
<[email protected]<mailto:[email protected]>>:
Hi Andrea,

  The RecursiveParserWrapper, as you found, is only for extracted content and 
metadata.   It was designed to cache metadata and content from embedded 
documents so that you can easily keep those two things together for each 
embedded document.

  To extract the raw bytes from embedded files, try implementing an 
EmbeddedDocumentExtractor and passing that into the ParseContext.  Take a look 
at 
http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
 and specifically the inner class MyEmbeddedDocument extractor for an example.  
As another example, look at 
http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java,
 and specifically the inner class: FileEmbeddedDocumentExtractor

Basically, in ParseEmbedded, just copy the InputStream to a FileOutputStream, 
and you should be good to go.

public boolean shouldParseEmbedded(Metadata metadata) {
    return true;
}

public void parseEmbedded(InputStream inputStream, ContentHandler 
contentHandler, Metadata metadata, boolean b) throws SAXException, IOException {

      Best,

                   Tim

From: Andrea Asta [mailto:[email protected]<mailto:[email protected]>]
Sent: Monday, July 06, 2015 6:11 AM
To: [email protected]<mailto:[email protected]>
Subject: Extract PDF inline images

Hello,
I'm trying to store the inline images from a PDF to a local folder, but can't 
find any valid example. I can only use the RecursiveParserWrapper to get all 
the available metadata, but not the binary image content.
This is my code:

RecursiveParserWrapper parser = new RecursiveParserWrapper(
      new AutoDetectParser(),
      new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParserConfig config = new PDFParserConfig();
PDFParser p;
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(false);
context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
context.set(org.apache.tika.parser.Parser.class, parser);

InputStream is = PdfRecursiveExample.class.getResourceAsStream("/BA200PDE.PDF");
//parsing the file
ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new 
File("out.txt")), "UTF-8");
parser.parse(is, handler, metadata, context);
How can I store each image file to a folder?
Thanks
Andrea

RE: Extract PDF inline images

Reply via email to