Please open a ticket on our JIRA and share an example file.  We'll want to 
update our package detector to handle this format.  As for parsing, XML is 
doable, and I'd be happy to try my hand at it...if we can find enough 
examples...  Please no protobufs, please no protobufs... :)

-----Original Message-----
From: Tucker Barbour [mailto:barb...@gmail.com] 
Sent: Thursday, August 10, 2017 5:56 AM
To: user@tika.apache.org
Subject: Outlook For Mac (OLM) Parser?

I have recently encountered a case where I need to parse an Outlook For Mac 
email archive (OLM). I have not found an officially published specification for 
the file format but after a bit of inspection it appears to be similar to the 
OOXML format. It's a ZIP file containing emails in an XML format and references 
to binary attachments. I was curious if anyone has explored writing a Parser 
for OLM. As expected, the AutoDetectParser detects the Content-Type as 
application/zip and the PackageParser is invoked. This "works" but ideally I 
could parse an OLM similiar to other email archives such as PST or MBOX where 
embedded content is handled as emails rather than XML. Since the file format is 
similar to OOXML it might not be too hard to write a parser but was curious if 
anyone else might have already done some work in this area.

-Tucker

Reply via email to