Hi Sebastion, Thanks for the reference. I had a look through the paper and it's certainly very relevant to the problem that I'm trying to solve. Do you think the CF functionality could be co-opted to output such document similarities as it stands or will it require modification? If it can be used straight off, say to give the top 25 most related documents for each document, then how would you suggest that I go about this?
Thanks, Kris 2010/6/8 Sebastian Schelter <[email protected]> > Hi Kris, > > actually the code to compute the item-to-item similarities in the > collaborative filtering part of mahout (which at the first look seems to be > a totally different problem than yours) is based on a paper that deals with > computing the pairwise similarity of text documents in a very simple way. > Maybe that could be helpful to you: > > Elsayed et al: Pairwise Document Similarity in Large Collections with > MapReduce > > http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf<http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf> > < > http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > > -sebastian > > > 2010/6/8 Kris Jack <[email protected]> > > > Hi everyone, > > > > I currently use lucene's moreLikeThis function through solr to find > > documents that are related to one another. A single call, however, takes > > around 4 seconds to complete and I would like to reduce this. I got to > > thinking that I might be able to use Mahout to generate a document > > similarity matrix offline that could then be looked-up in real time for > > serving. Is this a reasonable use of Mahout? If so, what functions will > > generate a document similarity matrix? Also, I would like to be able to > > keep the text processing advantages provided through lucene so it would > > help > > if I could still use my lucene index. If not, then could you recommend > any > > alternative solutions please? > > > > Many thanks, > > Kris > > > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/
