AarKay,

  We have a unit test for an MSG embedded within an MSG in 
POIContainerExtractionTest.  I also just tried a newly created msg within an 
msg file, and I can extract the embedded content with 
TikaTest.RecursiveMetaParser.  This suggests that the issue is not within the 
OutlookParser.

  If you want the bytes of the embedded file, have you tried (or are you using) 
the Unpacker Resource?  IIRC, this gets the attachments (non-recursively!!!) 
out of each doc you send it and sends you back a zip (or tar).  You should be 
able to step through the ZipEntr(ies) and get the original attachment bytes.

       Best,

                 Tim
  

-----Original Message-----
From: AarKay [mailto:ksu.wildc...@gmail.com] 
Sent: Thursday, July 31, 2014 12:30 AM
To: user@tika.apache.org
Subject: Tika - Outlook msg file with another Outlook msg as an attachment - 
OutlookExtractor passes empty stream

I am using Tika Server (TikaJaxRs) for text extraction needs.
I also have a need to extract the attachments in the file and save it to the 
disk in its native format.
I was able to do it by having CustomParser and write the file to disk using 
'stream' in parse method.

Here is the post I used as a reference for building CustomParser.
http://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-
files-using-apache-tika

I was able to get it work fine if the attachment is anything but Outlook msg 
file.

I am running into an issue when the attachment is a Outlook msg file.
When CustomParser.parse method gets invoked the stream passed to it is empty 
because of which the file thats being written to disk is always 0 KB.

Digging through the code I noticed that in OutlookExtractor.java class the 
attachment is handled by OfficeParser because msg.attachdata is always null 
when attachment is a Outlook msg and thats where it is always sending empty 
stream to CustomParser.

Here is the snippet of code from OutlookExtractor where it iterates through 
Attachment files and uses handleEmbeddedResource method only when 
msg.attachData is not null.
But msg.attachData is always null if the Attachment is of type Outlook msg 
because of which stream is always empty when delegating the request to 
CustomParser.parse method.

Can someone please tell me how can i access the msg attachment and save it 
to disk in its Native format?

for (AttachmentChunks attachment : msg.getAttachmentFiles()) {
               xhtml.startElement("div", "class", "attachment-entry");          
     
               String filename = null;
               if (attachment.attachLongFileName != null) {
                  filename = attachment.attachLongFileName.getValue();
               } else if (attachment.attachFileName != null) {
                  filename = attachment.attachFileName.getValue();
               }
               if (filename != null && filename.length() > 0) {
                   xhtml.element("h1", filename);
               }               
               if(attachment.attachData != null) {
                  handleEmbeddedResource(                        
TikaInputStream.get(attachment.attachData.getValue()),
                        filename,
                        null, xhtml, true
                  );
               }
               if(attachment.attachmentDirectory != null) {
                  handleEmbededOfficeDoc(
                        attachment.attachmentDirectory.getDirectory(),
                        xhtml
                  );
               }
               xhtml.endElement("div");               
           }


Thanks
-AarKay

Reply via email to