AiIIIII. That's why Zimba keeps killing me. When you download from IMAP you get IDs but I now see that the mail behind it may not have correct ones.
Subject lines plus text/text block comparisons should work pretty well. There will be misfires from copy&pastes of parts, different languages in the 'el cabron escribe 'date string". On Wed, Aug 24, 2011 at 2:34 PM, Jeff Eastman <[email protected]> wrote: > IIRC (circa 2008) Outlook had its own proprietary notion of message ids and > did not follow the SMTP standard. That made threading with ids pretty much a > non-starter in many enterprises. We had some success using Subject > comparisons, as most email clients add Re: prefixes to the subject line in a > reply message. Threading using a combination of these two approaches proved > to be reasonably good, though full of exceptions. Good luck. > > -----Original Message----- > From: Ted Dunning [mailto:[email protected]] > Sent: Wednesday, August 24, 2011 2:15 PM > To: [email protected] > Subject: Re: Mail thread detection [was Email and Collab. Filtering] > > In the olden days, it was possible to thread together message id's in email > threads. > > In the modern world of many mailing list portals that don't really do email > in the official ways, this is more difficult than it should be. > > Have you tried and failed with message id's? > > On Wed, Aug 24, 2011 at 1:06 PM, Lukáš Vlček <[email protected]> wrote: > >> Hi, >> >> I would love to hear more about how exactly you detect (or define) threads >> for emails (for example for Lucene or Solr public mail lists). >> >> As far as I can tell this is quite complex problem and based on my >> experience with many search web tools for mail lists this is still not >> solved. Speaking about thread based recommendations there can be missed >> important information if the thread is not detected correctly. >> If this has been already solved then please do not hesitate to point me to >> any references. >> >> Reagards, >> Lukas >> >> On Mon, Aug 22, 2011 at 4:48 PM, Grant Ingersoll <[email protected] >> >wrote: >> >> > I'm working on an example (well, examples) of using Mahout with the ASF >> > Public Data Set up on Amazon ( >> > http://aws.amazon.com/datasets/7791434387204566) and I wanted to show >> how >> > to use the 3 "C's" (collab filtering, clustering, classification) with >> the >> > data set. Clustering and classification are pretty straight forward, but >> > I'm wondering about the setup around collaborative filtering. >> > >> > The motivation for recommendations is pretty straightforward: provide >> > people recs on emails that they might find useful based on what other >> people >> > have interacted with. The tricky part is I am not totally sure on a >> valid >> > setup of the problem. My current thinking is that I build up the rec. >> > matrix based on whether someone has interacted with (initiated/replied) a >> > thread or not. Thus, the columns are the thread ids and the rows are the >> > users. Each cell contains the count of the number of times user X has >> > interacted with thread Y. This feels to me like it is a stand in for >> that >> > user's preference in that if they are replying multiple times, they have >> an >> > interest in that topic. I have no idea if this will be effective or not, >> > but it seems like it could be interesting. Does it sound reasonable? I >> > worry that even in a really large data set as above it will simply be too >> > sparse. >> > >> > Is there a better way to think about this from a strict collaborative >> > filtering context? In other words, I know I could do content-based >> > recommendations but that is not what I am after here. >> > >> > -Grant >> > >> > -------------------------------------------- >> > Grant Ingersoll >> > http://www.lucidimagination.com >> > >> > >> > -- Lance Norskog [email protected]
