Hi Kris, actually the code to compute the item-to-item similarities in the collaborative filtering part of mahout (which at the first look seems to be a totally different problem than yours) is based on a paper that deals with computing the pairwise similarity of text documents in a very simple way. Maybe that could be helpful to you:
Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf<http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf> -sebastian 2010/6/8 Kris Jack <[email protected]> > Hi everyone, > > I currently use lucene's moreLikeThis function through solr to find > documents that are related to one another. A single call, however, takes > around 4 seconds to complete and I would like to reduce this. I got to > thinking that I might be able to use Mahout to generate a document > similarity matrix offline that could then be looked-up in real time for > serving. Is this a reasonable use of Mahout? If so, what functions will > generate a document similarity matrix? Also, I would like to be able to > keep the text processing advantages provided through lucene so it would > help > if I could still use my lucene index. If not, then could you recommend any > alternative solutions please? > > Many thanks, > Kris >
