I think that the FAQ clustering project used mostly subject lines and they got quite usable results.
On Wed, Aug 24, 2011 at 3:15 PM, Lukáš Vlček <[email protected]> wrote: > It is, but since obviously other developers has already been dealing with > this mess (especially thread identification in mail lists) I was hoping > that > there would be some knowledge gathered ... may be it would be worth the > effort to put something together because this is important piece of > knowledge that can influence search results but people (users of search > interfaces) do not usually think about it in detail. > > On Wed, Aug 24, 2011 at 11:57 PM, Ted Dunning <[email protected]> > wrote: > > > The short conclusion is "people and language are involved, therefore it > is > > a > > bit of a mess". > > > > > > > > On Wed, Aug 24, 2011 at 2:49 PM, Lukáš Vlček <[email protected]> > > wrote: > > > > > Yes, it is not always reliable (especially if ppl reply to the email > from > > > desktop email clients and not from the web forum page). But there are > > more > > > complex problems than this. The two most common problems are also > thread > > > hijacking and something what I call non-linear mail thread, that is a > > case > > > when the email is resent also to a different mail list. For example the > > > thread starts in Lucene but at some point in time someone adds Solr > mail > > > list to the To or Cc as well. From this point the thread has two > parallel > > > branches (and still this is the simple case). > > > > > > Experimenting with mail Subject text is another option but again one > > would > > > not believe what kind of cases/or exceptions can be found until he > tries > > > it. > > > I have seen mails with the same subject, in the same mail list, in > about > > > the > > > same time window, involving the same author and the same reply-from > > person > > > and they were not in the same thread. > > > > > > IMHO I do not think there is any perfect solution to this problem. > Doing > > a > > > lot of experiments is probably a good way how to catch the most common > > > exceptions but in general it is very hard to avoid these problems. And > > once > > > you (as a user of a search interface) experience these issues it can be > > > quite challenging to build a trust that things like thread grouping or > > > recommendation works well enough. > > > > > > On Wed, Aug 24, 2011 at 11:15 PM, Ted Dunning <[email protected]> > > > wrote: > > > > > > > In the olden days, it was possible to thread together message id's in > > > email > > > > threads. > > > > > > > > In the modern world of many mailing list portals that don't really do > > > email > > > > in the official ways, this is more difficult than it should be. > > > > > > > > Have you tried and failed with message id's? > > > > > > > > On Wed, Aug 24, 2011 at 1:06 PM, Lukáš Vlček <[email protected]> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > I would love to hear more about how exactly you detect (or define) > > > > threads > > > > > for emails (for example for Lucene or Solr public mail lists). > > > > > > > > > > As far as I can tell this is quite complex problem and based on my > > > > > experience with many search web tools for mail lists this is still > > not > > > > > solved. Speaking about thread based recommendations there can be > > missed > > > > > important information if the thread is not detected correctly. > > > > > If this has been already solved then please do not hesitate to > point > > me > > > > to > > > > > any references. > > > > > > > > > > Reagards, > > > > > Lukas > > > > > > > > > > On Mon, Aug 22, 2011 at 4:48 PM, Grant Ingersoll < > > [email protected] > > > > > >wrote: > > > > > > > > > > > I'm working on an example (well, examples) of using Mahout with > the > > > ASF > > > > > > Public Data Set up on Amazon ( > > > > > > http://aws.amazon.com/datasets/7791434387204566) and I wanted to > > > show > > > > > how > > > > > > to use the 3 "C's" (collab filtering, clustering, classification) > > > with > > > > > the > > > > > > data set. Clustering and classification are pretty straight > > forward, > > > > but > > > > > > I'm wondering about the setup around collaborative filtering. > > > > > > > > > > > > The motivation for recommendations is pretty straightforward: > > > provide > > > > > > people recs on emails that they might find useful based on what > > other > > > > > people > > > > > > have interacted with. The tricky part is I am not totally sure > on > > a > > > > > valid > > > > > > setup of the problem. My current thinking is that I build up the > > > rec. > > > > > > matrix based on whether someone has interacted with > > > (initiated/replied) > > > > a > > > > > > thread or not. Thus, the columns are the thread ids and the rows > > are > > > > the > > > > > > users. Each cell contains the count of the number of times user > X > > > has > > > > > > interacted with thread Y. This feels to me like it is a stand in > > for > > > > > that > > > > > > user's preference in that if they are replying multiple times, > they > > > > have > > > > > an > > > > > > interest in that topic. I have no idea if this will be effective > > or > > > > not, > > > > > > but it seems like it could be interesting. Does it sound > > reasonable? > > > > I > > > > > > worry that even in a really large data set as above it will > simply > > be > > > > too > > > > > > sparse. > > > > > > > > > > > > Is there a better way to think about this from a strict > > collaborative > > > > > > filtering context? In other words, I know I could do > content-based > > > > > > recommendations but that is not what I am after here. > > > > > > > > > > > > -Grant > > > > > > > > > > > > -------------------------------------------- > > > > > > Grant Ingersoll > > > > > > http://www.lucidimagination.com > > > > > > > > > > > > > > > > > > > > > > > > > > >
