Re: Generating a Document Similarity Matrix

Sebastian Schelter Tue, 08 Jun 2010 15:40:07 -0700

Hi Kris,

actually the code to compute the item-to-item similarities in the
collaborative filtering part of mahout (which at the first look seems to be
a totally different problem than yours) is based on a paper that deals with
computing the pairwise similarity of text documents in a very simple way.
Maybe that  could be helpful to you:


Elsayed et al: Pairwise Document Similarity in Large Collections with
MapReduce
http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf<http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf>

-sebastian


2010/6/8 Kris Jack <[email protected]>

> Hi everyone,
>
> I currently use lucene's moreLikeThis function through solr to find
> documents that are related to one another.  A single call, however, takes
> around 4 seconds to complete and I would like to reduce this.  I got to
> thinking that I might be able to use Mahout to generate a document
> similarity matrix offline that could then be looked-up in real time for
> serving.  Is this a reasonable use of Mahout?  If so, what functions will
> generate a document similarity matrix?  Also, I would like to be able to
> keep the text processing advantages provided through lucene so it would
> help
> if I could still use my lucene index.  If not, then could you recommend any
> alternative solutions please?
>
> Many thanks,
> Kris
>

Re: Generating a Document Similarity Matrix

Reply via email to