Which is why Hoss always sends http://people.apache.org/~hossman/#threadhijack when that happens.
On Aug 25, 2011, at 8:48 AM, Jeff Eastman wrote: > And, people (incl. me) often hit <reply[-all]> to a message to get the to: > and cc: fields, add an entirely new subject: and begin an entirely new thread > of discussion. Of course, that gets the old messageId and so the new thread > will be buried inside some other thread unless subjects are considered too. > > -----Original Message----- > From: Lukáš Vlček [mailto:[email protected]] > Sent: Thursday, August 25, 2011 5:47 AM > To: [email protected] > Subject: Re: Mail thread detection [was Email and Collab. Filtering] > > On Thu, Aug 25, 2011 at 2:40 PM, Grant Ingersoll <[email protected]>wrote: > >> I think on Lucid's mail site, we use a combination of message id, subject >> and a few other heuristics. The whole problem gets even more fun when you >> think about the fact that people can essentially reopen a thread at any >> point in the future (even years later). >> >> Ironically, this very thread, will likely cause problems since it has the >> same message id, even though the subject line was partially changed. >> > > Exactly, done on purpose :-) > > >> >> On Aug 24, 2011, at 3:15 PM, Lukáš Vlček wrote: >> >>> It is, but since obviously other developers has already been dealing with >>> this mess (especially thread identification in mail lists) I was hoping >> that >>> there would be some knowledge gathered ... may be it would be worth the >>> effort to put something together because this is important piece of >>> knowledge that can influence search results but people (users of search >>> interfaces) do not usually think about it in detail. >>> >>> On Wed, Aug 24, 2011 at 11:57 PM, Ted Dunning <[email protected]> >> wrote: >>> >>>> The short conclusion is "people and language are involved, therefore it >> is >>>> a >>>> bit of a mess". >>>> >>>> >>>> >>>> On Wed, Aug 24, 2011 at 2:49 PM, Lukáš Vlček <[email protected]> >>>> wrote: >>>> >>>>> Yes, it is not always reliable (especially if ppl reply to the email >> from >>>>> desktop email clients and not from the web forum page). But there are >>>> more >>>>> complex problems than this. The two most common problems are also >> thread >>>>> hijacking and something what I call non-linear mail thread, that is a >>>> case >>>>> when the email is resent also to a different mail list. For example the >>>>> thread starts in Lucene but at some point in time someone adds Solr >> mail >>>>> list to the To or Cc as well. From this point the thread has two >> parallel >>>>> branches (and still this is the simple case). >>>>> >>>>> Experimenting with mail Subject text is another option but again one >>>> would >>>>> not believe what kind of cases/or exceptions can be found until he >> tries >>>>> it. >>>>> I have seen mails with the same subject, in the same mail list, in >> about >>>>> the >>>>> same time window, involving the same author and the same reply-from >>>> person >>>>> and they were not in the same thread. >>>>> >>>>> IMHO I do not think there is any perfect solution to this problem. >> Doing >>>> a >>>>> lot of experiments is probably a good way how to catch the most common >>>>> exceptions but in general it is very hard to avoid these problems. And >>>> once >>>>> you (as a user of a search interface) experience these issues it can be >>>>> quite challenging to build a trust that things like thread grouping or >>>>> recommendation works well enough. >>>>> >>>>> On Wed, Aug 24, 2011 at 11:15 PM, Ted Dunning <[email protected]> >>>>> wrote: >>>>> >>>>>> In the olden days, it was possible to thread together message id's in >>>>> email >>>>>> threads. >>>>>> >>>>>> In the modern world of many mailing list portals that don't really do >>>>> email >>>>>> in the official ways, this is more difficult than it should be. >>>>>> >>>>>> Have you tried and failed with message id's? >>>>>> >>>>>> On Wed, Aug 24, 2011 at 1:06 PM, Lukáš Vlček <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I would love to hear more about how exactly you detect (or define) >>>>>> threads >>>>>>> for emails (for example for Lucene or Solr public mail lists). >>>>>>> >>>>>>> As far as I can tell this is quite complex problem and based on my >>>>>>> experience with many search web tools for mail lists this is still >>>> not >>>>>>> solved. Speaking about thread based recommendations there can be >>>> missed >>>>>>> important information if the thread is not detected correctly. >>>>>>> If this has been already solved then please do not hesitate to point >>>> me >>>>>> to >>>>>>> any references. >>>>>>> >>>>>>> Reagards, >>>>>>> Lukas >>>>>>> >>>>>>> On Mon, Aug 22, 2011 at 4:48 PM, Grant Ingersoll < >>>> [email protected] >>>>>>>> wrote: >>>>>>> >>>>>>>> I'm working on an example (well, examples) of using Mahout with the >>>>> ASF >>>>>>>> Public Data Set up on Amazon ( >>>>>>>> http://aws.amazon.com/datasets/7791434387204566) and I wanted to >>>>> show >>>>>>> how >>>>>>>> to use the 3 "C's" (collab filtering, clustering, classification) >>>>> with >>>>>>> the >>>>>>>> data set. Clustering and classification are pretty straight >>>> forward, >>>>>> but >>>>>>>> I'm wondering about the setup around collaborative filtering. >>>>>>>> >>>>>>>> The motivation for recommendations is pretty straightforward: >>>>> provide >>>>>>>> people recs on emails that they might find useful based on what >>>> other >>>>>>> people >>>>>>>> have interacted with. The tricky part is I am not totally sure on >>>> a >>>>>>> valid >>>>>>>> setup of the problem. My current thinking is that I build up the >>>>> rec. >>>>>>>> matrix based on whether someone has interacted with >>>>> (initiated/replied) >>>>>> a >>>>>>>> thread or not. Thus, the columns are the thread ids and the rows >>>> are >>>>>> the >>>>>>>> users. Each cell contains the count of the number of times user X >>>>> has >>>>>>>> interacted with thread Y. This feels to me like it is a stand in >>>> for >>>>>>> that >>>>>>>> user's preference in that if they are replying multiple times, they >>>>>> have >>>>>>> an >>>>>>>> interest in that topic. I have no idea if this will be effective >>>> or >>>>>> not, >>>>>>>> but it seems like it could be interesting. Does it sound >>>> reasonable? >>>>>> I >>>>>>>> worry that even in a really large data set as above it will simply >>>> be >>>>>> too >>>>>>>> sparse. >>>>>>>> >>>>>>>> Is there a better way to think about this from a strict >>>> collaborative >>>>>>>> filtering context? In other words, I know I could do content-based >>>>>>>> recommendations but that is not what I am after here. >>>>>>>> >>>>>>>> -Grant >>>>>>>> >>>>>>>> -------------------------------------------- >>>>>>>> Grant Ingersoll >>>>>>>> http://www.lucidimagination.com >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >> -------------------------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com >> >> -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
