I think that the FAQ clustering project used mostly subject lines and they
got quite usable results.

On Wed, Aug 24, 2011 at 3:15 PM, Lukáš Vlček <[email protected]> wrote:

> It is, but since obviously other developers has already been dealing with
> this mess (especially thread identification in mail lists) I was hoping
> that
> there would be some knowledge gathered ... may be it would be worth the
> effort to put something together because this is important piece of
> knowledge that can influence search results but people (users of search
> interfaces) do not usually think about it in detail.
>
> On Wed, Aug 24, 2011 at 11:57 PM, Ted Dunning <[email protected]>
> wrote:
>
> > The short conclusion is "people and language are involved, therefore it
> is
> > a
> > bit of a mess".
> >
> >
> >
> > On Wed, Aug 24, 2011 at 2:49 PM, Lukáš Vlček <[email protected]>
> > wrote:
> >
> > > Yes, it is not always reliable (especially if ppl reply to the email
> from
> > > desktop email clients and not from the web forum page). But there are
> > more
> > > complex problems than this. The two most common problems are also
> thread
> > > hijacking and something what I call non-linear mail thread, that is a
> > case
> > > when the email is resent also to a different mail list. For example the
> > > thread starts in Lucene but at some point in time someone adds Solr
> mail
> > > list to the To or Cc as well. From this point the thread has two
> parallel
> > > branches (and still this is the simple case).
> > >
> > > Experimenting with mail Subject text is another option but again one
> > would
> > > not believe what kind of cases/or exceptions can be found until he
> tries
> > > it.
> > > I have seen mails with the same subject, in the same mail list, in
> about
> > > the
> > > same time window, involving the same author and the same reply-from
> > person
> > > and they were not in the same thread.
> > >
> > > IMHO I do not think there is any perfect solution to this problem.
> Doing
> > a
> > > lot of experiments is probably a good way how to catch the most common
> > > exceptions but in general it is very hard to avoid these problems. And
> > once
> > > you (as a user of a search interface) experience these issues it can be
> > > quite challenging to build a trust that things like thread grouping or
> > > recommendation works well enough.
> > >
> > > On Wed, Aug 24, 2011 at 11:15 PM, Ted Dunning <[email protected]>
> > > wrote:
> > >
> > > > In the olden days, it was possible to thread together message id's in
> > > email
> > > > threads.
> > > >
> > > > In the modern world of many mailing list portals that don't really do
> > > email
> > > > in the official ways, this is more difficult than it should be.
> > > >
> > > > Have you tried and failed with message id's?
> > > >
> > > > On Wed, Aug 24, 2011 at 1:06 PM, Lukáš Vlček <[email protected]>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I would love to hear more about how exactly you detect (or define)
> > > > threads
> > > > > for emails (for example for Lucene or Solr public mail lists).
> > > > >
> > > > > As far as I can tell this is quite complex problem and based on my
> > > > > experience with many search web tools for mail lists this is still
> > not
> > > > > solved. Speaking about thread based recommendations there can be
> > missed
> > > > > important information if the thread is not detected correctly.
> > > > > If this has been already solved then please do not hesitate to
> point
> > me
> > > > to
> > > > > any references.
> > > > >
> > > > > Reagards,
> > > > > Lukas
> > > > >
> > > > > On Mon, Aug 22, 2011 at 4:48 PM, Grant Ingersoll <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > I'm working on an example (well, examples) of using Mahout with
> the
> > > ASF
> > > > > > Public Data Set up on Amazon (
> > > > > > http://aws.amazon.com/datasets/7791434387204566) and I wanted to
> > > show
> > > > > how
> > > > > > to use the 3 "C's" (collab filtering, clustering, classification)
> > > with
> > > > > the
> > > > > > data set.  Clustering and classification are pretty straight
> > forward,
> > > > but
> > > > > > I'm wondering about the setup around collaborative filtering.
> > > > > >
> > > > > > The motivation for recommendations is pretty straightforward:
> > >  provide
> > > > > > people recs on emails that they might find useful based on what
> > other
> > > > > people
> > > > > > have interacted with.  The tricky part is I am not totally sure
> on
> > a
> > > > > valid
> > > > > > setup of the problem.  My current thinking is that I build up the
> > > rec.
> > > > > > matrix based on whether someone has interacted with
> > > (initiated/replied)
> > > > a
> > > > > > thread or not.  Thus, the columns are the thread ids and the rows
> > are
> > > > the
> > > > > > users.  Each cell contains the count of the number of times user
> X
> > > has
> > > > > > interacted with thread Y.  This feels to me like it is a stand in
> > for
> > > > > that
> > > > > > user's preference in that if they are replying multiple times,
> they
> > > > have
> > > > > an
> > > > > > interest in that topic.  I have no idea if this will be effective
> > or
> > > > not,
> > > > > > but it seems like it could be interesting.  Does it sound
> > reasonable?
> > > >  I
> > > > > > worry that even in a really large data set as above it will
> simply
> > be
> > > > too
> > > > > > sparse.
> > > > > >
> > > > > > Is there a better way to think about this from a strict
> > collaborative
> > > > > > filtering context?  In other words, I know I could do
> content-based
> > > > > > recommendations but that is not what I am after here.
> > > > > >
> > > > > > -Grant
> > > > > >
> > > > > > --------------------------------------------
> > > > > > Grant Ingersoll
> > > > > > http://www.lucidimagination.com
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to