I am using Tika Server (TikaJaxRs) for text extraction needs.
I also have a need to extract the attachments in the file and save it to the 
disk in its native format.
I was able to do it by having CustomParser and write the file to disk using 
'stream' in parse method.

Here is the post I used as a reference for building CustomParser.
http://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-
files-using-apache-tika

I was able to get it work fine if the attachment is anything but Outlook msg 
file.

I am running into an issue when the attachment is a Outlook msg file.
When CustomParser.parse method gets invoked the stream passed to it is empty 
because of which the file thats being written to disk is always 0 KB.

Digging through the code I noticed that in OutlookExtractor.java class the 
attachment is handled by OfficeParser because msg.attachdata is always null 
when attachment is a Outlook msg and thats where it is always sending empty 
stream to CustomParser.

Here is the snippet of code from OutlookExtractor where it iterates through 
Attachment files and uses handleEmbeddedResource method only when 
msg.attachData is not null.
But msg.attachData is always null if the Attachment is of type Outlook msg 
because of which stream is always empty when delegating the request to 
CustomParser.parse method.

Can someone please tell me how can i access the msg attachment and save it 
to disk in its Native format?

for (AttachmentChunks attachment : msg.getAttachmentFiles()) {
               xhtml.startElement("div", "class", "attachment-entry");          
     
               String filename = null;
               if (attachment.attachLongFileName != null) {
                  filename = attachment.attachLongFileName.getValue();
               } else if (attachment.attachFileName != null) {
                  filename = attachment.attachFileName.getValue();
               }
               if (filename != null && filename.length() > 0) {
                   xhtml.element("h1", filename);
               }               
               if(attachment.attachData != null) {
                  handleEmbeddedResource(                        
TikaInputStream.get(attachment.attachData.getValue()),
                        filename,
                        null, xhtml, true
                  );
               }
               if(attachment.attachmentDirectory != null) {
                  handleEmbededOfficeDoc(
                        attachment.attachmentDirectory.getDirectory(),
                        xhtml
                  );
               }
               xhtml.endElement("div");               
           }


Thanks
-AarKay

Reply via email to