AarKay,
We have a unit test for an MSG embedded within an MSG in
POIContainerExtractionTest. I also just tried a newly created msg within an
msg file, and I can extract the embedded content with
TikaTest.RecursiveMetaParser. This suggests that the issue is not within the
OutlookParser.
If you want the bytes of the embedded file, have you tried (or are you using)
the Unpacker Resource? IIRC, this gets the attachments (non-recursively!!!)
out of each doc you send it and sends you back a zip (or tar). You should be
able to step through the ZipEntr(ies) and get the original attachment bytes.
Best,
Tim
-----Original Message-----
From: AarKay [mailto:[email protected]]
Sent: Thursday, July 31, 2014 12:30 AM
To: [email protected]
Subject: Tika - Outlook msg file with another Outlook msg as an attachment -
OutlookExtractor passes empty stream
I am using Tika Server (TikaJaxRs) for text extraction needs.
I also have a need to extract the attachments in the file and save it to the
disk in its native format.
I was able to do it by having CustomParser and write the file to disk using
'stream' in parse method.
Here is the post I used as a reference for building CustomParser.
http://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-
files-using-apache-tika
I was able to get it work fine if the attachment is anything but Outlook msg
file.
I am running into an issue when the attachment is a Outlook msg file.
When CustomParser.parse method gets invoked the stream passed to it is empty
because of which the file thats being written to disk is always 0 KB.
Digging through the code I noticed that in OutlookExtractor.java class the
attachment is handled by OfficeParser because msg.attachdata is always null
when attachment is a Outlook msg and thats where it is always sending empty
stream to CustomParser.
Here is the snippet of code from OutlookExtractor where it iterates through
Attachment files and uses handleEmbeddedResource method only when
msg.attachData is not null.
But msg.attachData is always null if the Attachment is of type Outlook msg
because of which stream is always empty when delegating the request to
CustomParser.parse method.
Can someone please tell me how can i access the msg attachment and save it
to disk in its Native format?
for (AttachmentChunks attachment : msg.getAttachmentFiles()) {
xhtml.startElement("div", "class", "attachment-entry");
String filename = null;
if (attachment.attachLongFileName != null) {
filename = attachment.attachLongFileName.getValue();
} else if (attachment.attachFileName != null) {
filename = attachment.attachFileName.getValue();
}
if (filename != null && filename.length() > 0) {
xhtml.element("h1", filename);
}
if(attachment.attachData != null) {
handleEmbeddedResource(
TikaInputStream.get(attachment.attachData.getValue()),
filename,
null, xhtml, true
);
}
if(attachment.attachmentDirectory != null) {
handleEmbededOfficeDoc(
attachment.attachmentDirectory.getDirectory(),
xhtml
);
}
xhtml.endElement("div");
}
Thanks
-AarKay