IIRC (circa 2008) Outlook had its own proprietary notion of message ids and did 
not follow the SMTP standard. That made threading with ids pretty much a 
non-starter in many enterprises. We had some success using Subject comparisons, 
as most email clients add Re: prefixes to the subject line in a reply message. 
Threading using a combination of these two approaches proved to be reasonably 
good, though full of exceptions. Good luck.

-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Wednesday, August 24, 2011 2:15 PM
To: [email protected]
Subject: Re: Mail thread detection [was Email and Collab. Filtering]

In the olden days, it was possible to thread together message id's in email
threads.

In the modern world of many mailing list portals that don't really do email
in the official ways, this is more difficult than it should be.

Have you tried and failed with message id's?

On Wed, Aug 24, 2011 at 1:06 PM, Lukáš Vlček <[email protected]> wrote:

> Hi,
>
> I would love to hear more about how exactly you detect (or define) threads
> for emails (for example for Lucene or Solr public mail lists).
>
> As far as I can tell this is quite complex problem and based on my
> experience with many search web tools for mail lists this is still not
> solved. Speaking about thread based recommendations there can be missed
> important information if the thread is not detected correctly.
> If this has been already solved then please do not hesitate to point me to
> any references.
>
> Reagards,
> Lukas
>
> On Mon, Aug 22, 2011 at 4:48 PM, Grant Ingersoll <[email protected]
> >wrote:
>
> > I'm working on an example (well, examples) of using Mahout with the ASF
> > Public Data Set up on Amazon (
> > http://aws.amazon.com/datasets/7791434387204566) and I wanted to show
> how
> > to use the 3 "C's" (collab filtering, clustering, classification) with
> the
> > data set.  Clustering and classification are pretty straight forward, but
> > I'm wondering about the setup around collaborative filtering.
> >
> > The motivation for recommendations is pretty straightforward:  provide
> > people recs on emails that they might find useful based on what other
> people
> > have interacted with.  The tricky part is I am not totally sure on a
> valid
> > setup of the problem.  My current thinking is that I build up the rec.
> > matrix based on whether someone has interacted with (initiated/replied) a
> > thread or not.  Thus, the columns are the thread ids and the rows are the
> > users.  Each cell contains the count of the number of times user X has
> > interacted with thread Y.  This feels to me like it is a stand in for
> that
> > user's preference in that if they are replying multiple times, they have
> an
> > interest in that topic.  I have no idea if this will be effective or not,
> > but it seems like it could be interesting.  Does it sound reasonable?  I
> > worry that even in a really large data set as above it will simply be too
> > sparse.
> >
> > Is there a better way to think about this from a strict collaborative
> > filtering context?  In other words, I know I could do content-based
> > recommendations but that is not what I am after here.
> >
> > -Grant
> >
> > --------------------------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com
> >
> >
>

Reply via email to