Hi, I am trying to puzzle this out, and am hoping for some insight - I have an IMAP inbox dump that I am analyzing - I need to track how many times a given item is referred to in the inbox, i.e. how many emails came in about that thing and over what time. I can load it into MapReduce as TextInputFormat and parse it properly, and have managed to crudely concatenate lines that represent an email together as my final output, so, basically, it is working now, but my program is seeing each line as an InputSplit, and I so it is only working reliably with one InputFileSplit. If I had a bigger file, with multiple InputFileSplits presenting line-by-line InputSplits, I have no way to be sure that the lines that make one email will not end up in two different splits - does that make sense?
Someone I work with suggested that I attempt to read each email as a record, since they have their MIME encoding intact in the text dump, rather than each line as a record. Does anyone know of a MIME MapReduce input type? I can't be sure this will help anyway, since the file is already text-encoded - I may have to get the email from the original inbox as individual messages somehow to utilize the MIME header information. Googling this has been challenging, mainly because the words you have to use are somewhat overloaded - but I am finding some good clown schools in my research...so, any help is appreciated. Thanks! *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com
