On Thu, Aug 25, 2011 at 2:40 PM, Grant Ingersoll <[email protected]>wrote:
> I think on Lucid's mail site, we use a combination of message id, subject > and a few other heuristics. The whole problem gets even more fun when you > think about the fact that people can essentially reopen a thread at any > point in the future (even years later). > > Ironically, this very thread, will likely cause problems since it has the > same message id, even though the subject line was partially changed. > Exactly, done on purpose :-) > > On Aug 24, 2011, at 3:15 PM, Lukáš Vlček wrote: > > > It is, but since obviously other developers has already been dealing with > > this mess (especially thread identification in mail lists) I was hoping > that > > there would be some knowledge gathered ... may be it would be worth the > > effort to put something together because this is important piece of > > knowledge that can influence search results but people (users of search > > interfaces) do not usually think about it in detail. > > > > On Wed, Aug 24, 2011 at 11:57 PM, Ted Dunning <[email protected]> > wrote: > > > >> The short conclusion is "people and language are involved, therefore it > is > >> a > >> bit of a mess". > >> > >> > >> > >> On Wed, Aug 24, 2011 at 2:49 PM, Lukáš Vlček <[email protected]> > >> wrote: > >> > >>> Yes, it is not always reliable (especially if ppl reply to the email > from > >>> desktop email clients and not from the web forum page). But there are > >> more > >>> complex problems than this. The two most common problems are also > thread > >>> hijacking and something what I call non-linear mail thread, that is a > >> case > >>> when the email is resent also to a different mail list. For example the > >>> thread starts in Lucene but at some point in time someone adds Solr > mail > >>> list to the To or Cc as well. From this point the thread has two > parallel > >>> branches (and still this is the simple case). > >>> > >>> Experimenting with mail Subject text is another option but again one > >> would > >>> not believe what kind of cases/or exceptions can be found until he > tries > >>> it. > >>> I have seen mails with the same subject, in the same mail list, in > about > >>> the > >>> same time window, involving the same author and the same reply-from > >> person > >>> and they were not in the same thread. > >>> > >>> IMHO I do not think there is any perfect solution to this problem. > Doing > >> a > >>> lot of experiments is probably a good way how to catch the most common > >>> exceptions but in general it is very hard to avoid these problems. And > >> once > >>> you (as a user of a search interface) experience these issues it can be > >>> quite challenging to build a trust that things like thread grouping or > >>> recommendation works well enough. > >>> > >>> On Wed, Aug 24, 2011 at 11:15 PM, Ted Dunning <[email protected]> > >>> wrote: > >>> > >>>> In the olden days, it was possible to thread together message id's in > >>> email > >>>> threads. > >>>> > >>>> In the modern world of many mailing list portals that don't really do > >>> email > >>>> in the official ways, this is more difficult than it should be. > >>>> > >>>> Have you tried and failed with message id's? > >>>> > >>>> On Wed, Aug 24, 2011 at 1:06 PM, Lukáš Vlček <[email protected]> > >>>> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> I would love to hear more about how exactly you detect (or define) > >>>> threads > >>>>> for emails (for example for Lucene or Solr public mail lists). > >>>>> > >>>>> As far as I can tell this is quite complex problem and based on my > >>>>> experience with many search web tools for mail lists this is still > >> not > >>>>> solved. Speaking about thread based recommendations there can be > >> missed > >>>>> important information if the thread is not detected correctly. > >>>>> If this has been already solved then please do not hesitate to point > >> me > >>>> to > >>>>> any references. > >>>>> > >>>>> Reagards, > >>>>> Lukas > >>>>> > >>>>> On Mon, Aug 22, 2011 at 4:48 PM, Grant Ingersoll < > >> [email protected] > >>>>>> wrote: > >>>>> > >>>>>> I'm working on an example (well, examples) of using Mahout with the > >>> ASF > >>>>>> Public Data Set up on Amazon ( > >>>>>> http://aws.amazon.com/datasets/7791434387204566) and I wanted to > >>> show > >>>>> how > >>>>>> to use the 3 "C's" (collab filtering, clustering, classification) > >>> with > >>>>> the > >>>>>> data set. Clustering and classification are pretty straight > >> forward, > >>>> but > >>>>>> I'm wondering about the setup around collaborative filtering. > >>>>>> > >>>>>> The motivation for recommendations is pretty straightforward: > >>> provide > >>>>>> people recs on emails that they might find useful based on what > >> other > >>>>> people > >>>>>> have interacted with. The tricky part is I am not totally sure on > >> a > >>>>> valid > >>>>>> setup of the problem. My current thinking is that I build up the > >>> rec. > >>>>>> matrix based on whether someone has interacted with > >>> (initiated/replied) > >>>> a > >>>>>> thread or not. Thus, the columns are the thread ids and the rows > >> are > >>>> the > >>>>>> users. Each cell contains the count of the number of times user X > >>> has > >>>>>> interacted with thread Y. This feels to me like it is a stand in > >> for > >>>>> that > >>>>>> user's preference in that if they are replying multiple times, they > >>>> have > >>>>> an > >>>>>> interest in that topic. I have no idea if this will be effective > >> or > >>>> not, > >>>>>> but it seems like it could be interesting. Does it sound > >> reasonable? > >>>> I > >>>>>> worry that even in a really large data set as above it will simply > >> be > >>>> too > >>>>>> sparse. > >>>>>> > >>>>>> Is there a better way to think about this from a strict > >> collaborative > >>>>>> filtering context? In other words, I know I could do content-based > >>>>>> recommendations but that is not what I am after here. > >>>>>> > >>>>>> -Grant > >>>>>> > >>>>>> -------------------------------------------- > >>>>>> Grant Ingersoll > >>>>>> http://www.lucidimagination.com > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com > >
