Re: Mail thread detection [was Email and Collab. Filtering]

Grant Ingersoll Fri, 26 Aug 2011 17:41:58 -0700

Which is why Hoss always sends http://people.apache.org/~hossman/#threadhijack 
when that happens.



On Aug 25, 2011, at 8:48 AM, Jeff Eastman wrote:

> And, people (incl. me) often hit <reply[-all]> to a message to get the to: 
> and cc: fields, add an entirely new subject: and begin an entirely new thread 
> of discussion. Of course, that gets the old messageId and so the new thread 
> will be buried inside some other thread unless subjects are considered too. 
> 
> -----Original Message-----
> From: Lukáš Vlček [mailto:[email protected]] 
> Sent: Thursday, August 25, 2011 5:47 AM
> To: [email protected]
> Subject: Re: Mail thread detection [was Email and Collab. Filtering]
> 
> On Thu, Aug 25, 2011 at 2:40 PM, Grant Ingersoll <[email protected]>wrote:
> 
>> I think on Lucid's mail site, we use a combination of message id, subject
>> and a few other heuristics.  The whole problem gets even more fun when you
>> think about the fact that people can essentially reopen a thread at any
>> point in the future (even years later).
>> 
>> Ironically, this very thread, will likely cause problems since it has the
>> same message id, even though the subject line was partially changed.
>> 
> 
> Exactly, done on purpose :-)
> 
> 
>> 
>> On Aug 24, 2011, at 3:15 PM, Lukáš Vlček wrote:
>> 
>>> It is, but since obviously other developers has already been dealing with
>>> this mess (especially thread identification in mail lists) I was hoping
>> that
>>> there would be some knowledge gathered ... may be it would be worth the
>>> effort to put something together because this is important piece of
>>> knowledge that can influence search results but people (users of search
>>> interfaces) do not usually think about it in detail.
>>> 
>>> On Wed, Aug 24, 2011 at 11:57 PM, Ted Dunning <[email protected]>
>> wrote:
>>> 
>>>> The short conclusion is "people and language are involved, therefore it
>> is
>>>> a
>>>> bit of a mess".
>>>> 
>>>> 
>>>> 
>>>> On Wed, Aug 24, 2011 at 2:49 PM, Lukáš Vlček <[email protected]>
>>>> wrote:
>>>> 
>>>>> Yes, it is not always reliable (especially if ppl reply to the email
>> from
>>>>> desktop email clients and not from the web forum page). But there are
>>>> more
>>>>> complex problems than this. The two most common problems are also
>> thread
>>>>> hijacking and something what I call non-linear mail thread, that is a
>>>> case
>>>>> when the email is resent also to a different mail list. For example the
>>>>> thread starts in Lucene but at some point in time someone adds Solr
>> mail
>>>>> list to the To or Cc as well. From this point the thread has two
>> parallel
>>>>> branches (and still this is the simple case).
>>>>> 
>>>>> Experimenting with mail Subject text is another option but again one
>>>> would
>>>>> not believe what kind of cases/or exceptions can be found until he
>> tries
>>>>> it.
>>>>> I have seen mails with the same subject, in the same mail list, in
>> about
>>>>> the
>>>>> same time window, involving the same author and the same reply-from
>>>> person
>>>>> and they were not in the same thread.
>>>>> 
>>>>> IMHO I do not think there is any perfect solution to this problem.
>> Doing
>>>> a
>>>>> lot of experiments is probably a good way how to catch the most common
>>>>> exceptions but in general it is very hard to avoid these problems. And
>>>> once
>>>>> you (as a user of a search interface) experience these issues it can be
>>>>> quite challenging to build a trust that things like thread grouping or
>>>>> recommendation works well enough.
>>>>> 
>>>>> On Wed, Aug 24, 2011 at 11:15 PM, Ted Dunning <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> In the olden days, it was possible to thread together message id's in
>>>>> email
>>>>>> threads.
>>>>>> 
>>>>>> In the modern world of many mailing list portals that don't really do
>>>>> email
>>>>>> in the official ways, this is more difficult than it should be.
>>>>>> 
>>>>>> Have you tried and failed with message id's?
>>>>>> 
>>>>>> On Wed, Aug 24, 2011 at 1:06 PM, Lukáš Vlček <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I would love to hear more about how exactly you detect (or define)
>>>>>> threads
>>>>>>> for emails (for example for Lucene or Solr public mail lists).
>>>>>>> 
>>>>>>> As far as I can tell this is quite complex problem and based on my
>>>>>>> experience with many search web tools for mail lists this is still
>>>> not
>>>>>>> solved. Speaking about thread based recommendations there can be
>>>> missed
>>>>>>> important information if the thread is not detected correctly.
>>>>>>> If this has been already solved then please do not hesitate to point
>>>> me
>>>>>> to
>>>>>>> any references.
>>>>>>> 
>>>>>>> Reagards,
>>>>>>> Lukas
>>>>>>> 
>>>>>>> On Mon, Aug 22, 2011 at 4:48 PM, Grant Ingersoll <
>>>> [email protected]
>>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I'm working on an example (well, examples) of using Mahout with the
>>>>> ASF
>>>>>>>> Public Data Set up on Amazon (
>>>>>>>> http://aws.amazon.com/datasets/7791434387204566) and I wanted to
>>>>> show
>>>>>>> how
>>>>>>>> to use the 3 "C's" (collab filtering, clustering, classification)
>>>>> with
>>>>>>> the
>>>>>>>> data set.  Clustering and classification are pretty straight
>>>> forward,
>>>>>> but
>>>>>>>> I'm wondering about the setup around collaborative filtering.
>>>>>>>> 
>>>>>>>> The motivation for recommendations is pretty straightforward:
>>>>> provide
>>>>>>>> people recs on emails that they might find useful based on what
>>>> other
>>>>>>> people
>>>>>>>> have interacted with.  The tricky part is I am not totally sure on
>>>> a
>>>>>>> valid
>>>>>>>> setup of the problem.  My current thinking is that I build up the
>>>>> rec.
>>>>>>>> matrix based on whether someone has interacted with
>>>>> (initiated/replied)
>>>>>> a
>>>>>>>> thread or not.  Thus, the columns are the thread ids and the rows
>>>> are
>>>>>> the
>>>>>>>> users.  Each cell contains the count of the number of times user X
>>>>> has
>>>>>>>> interacted with thread Y.  This feels to me like it is a stand in
>>>> for
>>>>>>> that
>>>>>>>> user's preference in that if they are replying multiple times, they
>>>>>> have
>>>>>>> an
>>>>>>>> interest in that topic.  I have no idea if this will be effective
>>>> or
>>>>>> not,
>>>>>>>> but it seems like it could be interesting.  Does it sound
>>>> reasonable?
>>>>>> I
>>>>>>>> worry that even in a really large data set as above it will simply
>>>> be
>>>>>> too
>>>>>>>> sparse.
>>>>>>>> 
>>>>>>>> Is there a better way to think about this from a strict
>>>> collaborative
>>>>>>>> filtering context?  In other words, I know I could do content-based
>>>>>>>> recommendations but that is not what I am after here.
>>>>>>>> 
>>>>>>>> -Grant
>>>>>>>> 
>>>>>>>> --------------------------------------------
>>>>>>>> Grant Ingersoll
>>>>>>>> http://www.lucidimagination.com
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> 
>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: Mail thread detection [was Email and Collab. Filtering]

Reply via email to