The short conclusion is "people and language are involved, therefore it is a bit of a mess".
On Wed, Aug 24, 2011 at 2:49 PM, Lukáš Vlček <[email protected]> wrote: > Yes, it is not always reliable (especially if ppl reply to the email from > desktop email clients and not from the web forum page). But there are more > complex problems than this. The two most common problems are also thread > hijacking and something what I call non-linear mail thread, that is a case > when the email is resent also to a different mail list. For example the > thread starts in Lucene but at some point in time someone adds Solr mail > list to the To or Cc as well. From this point the thread has two parallel > branches (and still this is the simple case). > > Experimenting with mail Subject text is another option but again one would > not believe what kind of cases/or exceptions can be found until he tries > it. > I have seen mails with the same subject, in the same mail list, in about > the > same time window, involving the same author and the same reply-from person > and they were not in the same thread. > > IMHO I do not think there is any perfect solution to this problem. Doing a > lot of experiments is probably a good way how to catch the most common > exceptions but in general it is very hard to avoid these problems. And once > you (as a user of a search interface) experience these issues it can be > quite challenging to build a trust that things like thread grouping or > recommendation works well enough. > > On Wed, Aug 24, 2011 at 11:15 PM, Ted Dunning <[email protected]> > wrote: > > > In the olden days, it was possible to thread together message id's in > email > > threads. > > > > In the modern world of many mailing list portals that don't really do > email > > in the official ways, this is more difficult than it should be. > > > > Have you tried and failed with message id's? > > > > On Wed, Aug 24, 2011 at 1:06 PM, Lukáš Vlček <[email protected]> > > wrote: > > > > > Hi, > > > > > > I would love to hear more about how exactly you detect (or define) > > threads > > > for emails (for example for Lucene or Solr public mail lists). > > > > > > As far as I can tell this is quite complex problem and based on my > > > experience with many search web tools for mail lists this is still not > > > solved. Speaking about thread based recommendations there can be missed > > > important information if the thread is not detected correctly. > > > If this has been already solved then please do not hesitate to point me > > to > > > any references. > > > > > > Reagards, > > > Lukas > > > > > > On Mon, Aug 22, 2011 at 4:48 PM, Grant Ingersoll <[email protected] > > > >wrote: > > > > > > > I'm working on an example (well, examples) of using Mahout with the > ASF > > > > Public Data Set up on Amazon ( > > > > http://aws.amazon.com/datasets/7791434387204566) and I wanted to > show > > > how > > > > to use the 3 "C's" (collab filtering, clustering, classification) > with > > > the > > > > data set. Clustering and classification are pretty straight forward, > > but > > > > I'm wondering about the setup around collaborative filtering. > > > > > > > > The motivation for recommendations is pretty straightforward: > provide > > > > people recs on emails that they might find useful based on what other > > > people > > > > have interacted with. The tricky part is I am not totally sure on a > > > valid > > > > setup of the problem. My current thinking is that I build up the > rec. > > > > matrix based on whether someone has interacted with > (initiated/replied) > > a > > > > thread or not. Thus, the columns are the thread ids and the rows are > > the > > > > users. Each cell contains the count of the number of times user X > has > > > > interacted with thread Y. This feels to me like it is a stand in for > > > that > > > > user's preference in that if they are replying multiple times, they > > have > > > an > > > > interest in that topic. I have no idea if this will be effective or > > not, > > > > but it seems like it could be interesting. Does it sound reasonable? > > I > > > > worry that even in a really large data set as above it will simply be > > too > > > > sparse. > > > > > > > > Is there a better way to think about this from a strict collaborative > > > > filtering context? In other words, I know I could do content-based > > > > recommendations but that is not what I am after here. > > > > > > > > -Grant > > > > > > > > -------------------------------------------- > > > > Grant Ingersoll > > > > http://www.lucidimagination.com > > > > > > > > > > > > > >
