* Nick Burch <[email protected]> [2018-09-05 07:36:46 +0100]:
yOn Tue, 4 Sep 2018, Tucker Barbour wrote:
I've exported a GMail archive in MBOX format using
takeout.google.com. The MBOX archive also includes GChat messages.
However, the GChat messages do not include a Date header. Instead
the date sent is included in what appears to be a non-conforming
RFC822 header which the tika mbox parser does not recognize.
As a user of Tika, were you expecting these to show up as additional
emails in the mbox, or something else?
For my use-case I care about the metadata and body content. Ultimately, the metadata and
body content end up in a search engine. So whether they are actually treated as emails or
not doesn't really matter than much to me. Ideally, I should be able to determine the
difference between a gchat message and an email. Maybe the presence of the X-GM-THRID
header? In the case of the exported gchat messages, the metadata that's relevant to my
use-case is the thread id, From, and Date headers. Tika gets most of the metadata I care
about except for the sent time. The additional From header seems to be at issue.
"From 1558692903658457318@xxx Tue Feb 07 16:36:29 +0000 2017". Body content is
properly sent to the ContentHandler.
(The underlying library may not give us a choice, I haven't dug in
enough recently to remember, but in case it does, user expectations
are of interst!)
I'm wondering if anyone has any experience extracting metadata from
Gmail exports, specifically gchat messages. Any help or guidance
would be appreciated.
Any chance you could share / produce a small mbox file, with a handful
of both real emails and these gchat messages in, so we can take a
look? If you could open a bug in jira, and attach the small mbox file,
that'd be great
I can spend some time cleaning up a data set for testing and will submit a JIRA ticket. In the mean time I might explore an additional parser and a custom-mimetypes.xml.
Nick