Ah yes. I would love for us to have an implementation of that pairwise similarity code. It would be useful for lots of things in Mahout, yes!
-jake On Tue, Jun 8, 2010 at 4:21 PM, Sebastian Schelter <[email protected]>wrote: > I did not wanna say you can use the item-item-similarity code from CF for > computing the document similarities, I just wanted to point out that these > problems are closely related and that the paper which the CF code is based > on is dealing with the computation of pairwise document similarities and > could therefore be helpful. > > -sebastian > > 2010/6/9 Jake Mannix <[email protected]> > > > The code in mahout CF is doing that? I don't think that's right, we > don't > > do anything that fancy right now, do we Sean? > > > > -jake > > > > On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter > > <[email protected]>wrote: > > > > > Hi Kris, > > > > > > actually the code to compute the item-to-item similarities in the > > > collaborative filtering part of mahout (which at the first look seems > to > > be > > > a totally different problem than yours) is based on a paper that deals > > with > > > computing the pairwise similarity of text documents in a very simple > way. > > > Maybe that could be helpful to you: > > > > > > Elsayed et al: Pairwise Document Similarity in Large Collections with > > > MapReduce > > > > > > > > > http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > < > > > > > > http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > > > > > > > > -sebastian > > > > > > > > > 2010/6/8 Kris Jack <[email protected]> > > > > > > > Hi everyone, > > > > > > > > I currently use lucene's moreLikeThis function through solr to find > > > > documents that are related to one another. A single call, however, > > takes > > > > around 4 seconds to complete and I would like to reduce this. I got > to > > > > thinking that I might be able to use Mahout to generate a document > > > > similarity matrix offline that could then be looked-up in real time > for > > > > serving. Is this a reasonable use of Mahout? If so, what functions > > will > > > > generate a document similarity matrix? Also, I would like to be able > > to > > > > keep the text processing advantages provided through lucene so it > would > > > > help > > > > if I could still use my lucene index. If not, then could you > recommend > > > any > > > > alternative solutions please? > > > > > > > > Many thanks, > > > > Kris > > > > > > > > > >
