Hi Tim,
thanks for your response, but I can't find a complete solution.
I've created a class using the same FileEmbeddedDocumentExtractor from
TikaCLI, and now I'm trying to do a sample main program with a PDF
containing some images.
This is my code, but I can't have any image stored and the methods of
DocumentExtractor are never called using debugger.
Thanks
Andrea
RecursiveParserWrapper parser = new RecursiveParserWrapper(
new AutoDetectParser(),
new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileEmbeddedDocumentExtractor extractor = new
FileEmbeddedDocumentExtractor();
context.set(FileEmbeddedDocumentExtractor.class, extractor);
PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(true);
context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
context.set(org.apache.tika.parser.Parser.class, new AutoDetectParser());
InputStream is = PdfRecursiveExample.class.getResourceAsStream("/my.PDF");
ToXMLContentHandler handler = new ToXMLContentHandler(new
FileOutputStream(new File("out.txt")), "UTF-8");
parser.parse(is, handler, metadata, context);
2015-07-06 12:59 GMT+02:00 Allison, Timothy B. <[email protected]>:
> Hi Andrea,
>
> The RecursiveParserWrapper, as you found, is only for extracted content and
> metadata. It was designed to cache metadata and content from embedded
> documents so that you can easily keep those two things together for each
> embedded document.
>
> To extract the raw bytes from embedded files, try implementing an
> EmbeddedDocumentExtractor and passing that into the ParseContext. Take a
> look at
> http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
> and specifically the inner class MyEmbeddedDocument extractor for an
> example. As another example, look at
> http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java,
> and specifically the inner class: FileEmbeddedDocumentExtractor
>
>
>
>
>
> Basically, in ParseEmbedded, just copy the InputStream to a
> FileOutputStream, and you should be good to go.
>
>
>
> *public boolean *shouldParseEmbedded(Metadata metadata) {
> *return true*;
> }
>
> *public void *parseEmbedded(InputStream inputStream, ContentHandler
> contentHandler, Metadata metadata, *boolean *b) *throws *SAXException,
> IOException {
>
>
>
> Best,
>
>
>
> Tim
>
>
>
> *From:* Andrea Asta [mailto:[email protected]]
> *Sent:* Monday, July 06, 2015 6:11 AM
> *To:* [email protected]
> *Subject:* Extract PDF inline images
>
>
>
> Hello,
>
> I'm trying to store the inline images from a PDF to a local folder, but
> can't find any valid example. I can only use the RecursiveParserWrapper to
> get all the available metadata, but not the binary image content.
>
> This is my code:
>
> RecursiveParserWrapper parser = new RecursiveParserWrapper(
> new AutoDetectParser(),
> new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
> );
> Metadata metadata = new Metadata();
> ParseContext context = new ParseContext();
> PDFParserConfig config = new PDFParserConfig();
> PDFParser p;
> config.setExtractInlineImages(true);
> config.setExtractUniqueInlineImagesOnly(false);
> context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
> context.set(org.apache.tika.parser.Parser.class, parser);
>
> InputStream is =
> PdfRecursiveExample.class.getResourceAsStream("/BA200PDE.PDF");
> //parsing the file
> ToXMLContentHandler handler = new ToXMLContentHandler(new
> FileOutputStream(new File("out.txt")), "UTF-8");
> parser.parse(is, handler, metadata, context);
>
> How can I store each image file to a folder?
>
> Thanks
>
> Andrea
>