Hey Devin, Are you perhaps looking for http://james.apache.org/mime4j/? You may have to adapt it for MR but I don't imagine that would be too difficult to do.
On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX <[email protected]> wrote: > Hi, > > I am trying to puzzle this out, and am hoping for some insight - I have an > IMAP inbox dump that I am analyzing - I need to track how many times a > given item is referred to in the inbox, i.e. how many emails came in about > that thing and over what time. I can load it into MapReduce as > TextInputFormat and parse it properly, and have managed to crudely > concatenate lines that represent an email together as my final output, so, > basically, it is working now, but my program is seeing each line as an > InputSplit, and I so it is only working reliably with one InputFileSplit. > If I had a bigger file, with multiple InputFileSplits presenting > line-by-line InputSplits, I have no way to be sure that the lines that make > one email will not end up in two different splits - does that make sense? > > Someone I work with suggested that I attempt to read each email as a > record, since they have their MIME encoding intact in the text dump, rather > than each line as a record. > > Does anyone know of a MIME MapReduce input type? I can't be sure this will > help anyway, since the file is already text-encoded - I may have to get the > email from the original inbox as individual messages somehow to utilize the > MIME header information. > > Googling this has been challenging, mainly because the words you have to > use are somewhat overloaded - but I am finding some good clown schools in > my research...so, any help is appreciated. > > Thanks! > *Devin Suiter* > Jr. Data Solutions Software Engineer > 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 > Google Voice: 412-256-8556 | www.rdx.com > -- Harsh J
