On Thu, 28 Jun 2012, Joe Wicentowski wrote:
I have an update regarding my report about Tika not recognizing the date in an Outlook .msg files [1]. I tried using a different tool, ruby-msg (http://code.google.com/p/ruby-msg/), to process the same message as in my earlier email, and ruby-msg did pull out the date [2] This experiment shows that the email *is* in the .msg file, and that Tika is failing to pick it up.
That suggests that it's stored in a different bit of the file (a different stream) to the ones we're expecting to find it in. The file format is documented, so you can look up what each different bit means, but there are a lot of duplicate fields for historical reasons. What we lack is a guide saying "outlook 200x stores the sent date as MAPI_???_DATE, while 200y uses OUTLOOK_DATE_MAPI_???_V3"
What'd be great is if you could use org.apache.poi.hsmf.dev.HSMFDump (contained within the poi-scratchpad jar, dependency on the main poi jar but I don't think anything else) to try to track down which chunk contains the date. You might need to combine that with a little bit of hacking of your ruby script, to have it print some debug logging of what fields it's printing from
Once we know the field, we can look up the details on how it's stored, then add a fallback check of that field/chunk too
Nick
