Thanks Harsh! Looks like something that might be useful! I appreciate it! *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com
On Tue, Dec 31, 2013 at 1:08 AM, Harsh J <[email protected]> wrote: > Hey Devin, > > Are you perhaps looking for http://james.apache.org/mime4j/? You may have > to adapt it for MR but I don't imagine that would be too difficult to do. > > On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX <[email protected]>wrote: > >> Hi, >> >> I am trying to puzzle this out, and am hoping for some insight - I have >> an IMAP inbox dump that I am analyzing - I need to track how many times a >> given item is referred to in the inbox, i.e. how many emails came in about >> that thing and over what time. I can load it into MapReduce as >> TextInputFormat and parse it properly, and have managed to crudely >> concatenate lines that represent an email together as my final output, so, >> basically, it is working now, but my program is seeing each line as an >> InputSplit, and I so it is only working reliably with one InputFileSplit. >> If I had a bigger file, with multiple InputFileSplits presenting >> line-by-line InputSplits, I have no way to be sure that the lines that make >> one email will not end up in two different splits - does that make sense? >> >> Someone I work with suggested that I attempt to read each email as a >> record, since they have their MIME encoding intact in the text dump, rather >> than each line as a record. >> >> Does anyone know of a MIME MapReduce input type? I can't be sure this >> will help anyway, since the file is already text-encoded - I may have to >> get the email from the original inbox as individual messages somehow to >> utilize the MIME header information. >> >> Googling this has been challenging, mainly because the words you have to >> use are somewhat overloaded - but I am finding some good clown schools in >> my research...so, any help is appreciated. >> >> Thanks! >> *Devin Suiter* >> Jr. Data Solutions Software Engineer >> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 >> Google Voice: 412-256-8556 | www.rdx.com >> > > > > -- > Harsh J >
