Re: Generating a Document Similarity Matrix

Kris Jack Wed, 09 Jun 2010 10:45:03 -0700

Hi Jake,

Thanks for that.  I'll give it a try with cosine similarity first off and as
I get more experience I'll try and implement some other similarity methods.


Kris



2010/6/9 Jake Mannix <[email protected]>

> On Wed, Jun 9, 2010 at 5:11 AM, Kris Jack <[email protected]> wrote:
>
> > Hi,
> >
> > Thanks very for the code.  In implementing it, I got a little stuck on
> > specifying the JobConf's similarity job -
> >
> >  JobConf conf = new JobConf("similarity job");
> >
> > I assume that I should define here how I would like two vectors to be
> > compared with one another?  Please do correct me if that's wrong.  If so,
> > however, could you point me to any examples of what this code should look
> > like (e.g. cosine similarity)?  I'm sure that these kinds of similarity
> > jobs
> > must already exist in Mahout but being new to both Mahout and MapReduce,
> > I'm
> > not sure where to find them.
> >
>
> In the sample I mentioned (using sparse matrix multiplication), you don't
> get to chose the similarity - if the input vectors are unit-length
> normalized,
> then the computation is cosine similarity.  You would have to write your
> own
> map-reduce job to do a different one.
>
>  -jake
>
>
>
>
> >
> > Thanks,
> > Kris
> >
> >
> >
> > 2010/6/8 Jake Mannix <[email protected]>
> >
> > > Hi Kris,
> > >
> > >  So you already know how to make a sparse feature matrix out of
> > > your Solr index, based on Grant's instructions?  Once you have that
> > > matrix loaded into HDFS, then the following Java code should
> > > create a document-document similarity matrix:
> > >
> > > //
> > >  String p = "/path/to/matrix/on/hdfs";
> > >  String tmpPath = "/tmp/matrixmultiplyspace";
> > >  int numDocuments = // whatever your numDocuments is
> > >  int numTerms = // total number of terms in the matrix
> > >
> > >  DistributedRowMatrix text = new DistributedRowMatrix(inputPath,
> > >    tmpPath, numDocuments, numTerms);
> > >  JobConf conf = new JobConf("similarity job");
> > >  text.configure(conf);
> > >
> > >  DistributedRowMatrix transpose = text.transpose();
> > >
> > >  DistributedRowMatrix similarity = transpose.times(transpose);
> > >  System.out.println("Similarity matrix lives: " +
> > similarity.getRowPath());
> > > //
> > >
> > > Now, the rows of this similarity are going to be way too dense, so
> > > you'll want to write a small map-reduce job (well, no reduce is
> > necessary)
> > > to run over this matrix and trim out all the unuseful entries of each
> > > row, but that shouldn't be too hard to do.
> > >
> > > Of course, to do it really efficiently, that functionality could be
> > folded
> > > into the reducer of the matrix multiply job, and done in the same pass
> > over
> > > the data as that one.
> > >
> > >  -jake
> > >
> > >
> > >
> > > On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack <[email protected]>
> wrote:
> > >
> > > > Hi Jake,
> > > >
> > > > Thanks for that.  The first solution that you suggest is more like
> what
> > I
> > > > was imagining.
> > > >
> > > > Please excuse me, I'm new to Mahout and don't know how to use it to
> > > > generate
> > > > the full document-document similarity matrix.  I would rather not
> have
> > to
> > > > re-implement the moreLikeThis algorithm that, although rather
> straight
> > > > forward, may take time for a newbie to MapReduce like me.  Could you
> > > guide
> > > > me a little in finding the relevant Mahout code for generating the
> > matrix
> > > > or
> > > > is it not really designed for that?
> > > >
> > > > For the moment, I would be happy to have an off-line batch version
> > > working.
> > > > Also, it is desirable to take advantage of the text processing
> features
> > > > that
> > > > I have already configured using solr, so I would prefer to read in
> the
> > > > feature vectors for the documents from a lucene index, as I am doing
> at
> > > > present (e.g.
> > > >
> > > >
> > >
> >
> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
> > > > ).
> > > >
> > > > Thanks,
> > > > Kris
> > > >
> > > >
> > > >
> > > > 2010/6/8 Jake Mannix <[email protected]>
> > > >
> > > > > Hi Kris,
> > > > >
> > > > >  If you generate a full document-document similarity matrix
> offline,
> > > and
> > > > > then make sure to sparsify the rows (trim off all similarities
> below
> > a
> > > > > threshold, or only take the top N for each row, etc...).  Then
> > encoding
> > > > > these values directly in the index would indeed allow for
> *superfast*
> > > > > MoreLikeThis functionality, because you've already computed all
> > > > > of the similar results offline.
> > > > >
> > > > >  The only downside is that it won't apply to newly indexed
> documents.
> > > > > If your indexing setup is such that you don't fold in new documents
> > > live,
> > > > > but do so in batch, then this should be fine.
> > > > >
> > > > >  An alternative is to use something like a Locality Sensitive Hash
> > > > > (something one of my co-workers is writing up a nice implementation
> > > > > of now, and I'm going to get him to contribute it once it's fully
> > > > tested),
> > > > > to reduce the search space (as a lucene Filter) and speed up the
> > > > > query.
> > > > >
> > > > >  -jake
> > > > >
> > > > > On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi Olivier,
> > > > > >
> > > > > > Thanks for your suggestions.  I have over 10 million documents
> and
> > > they
> > > > > > have
> > > > > > quite a lot of meta-data associated with them including rather
> > large
> > > > text
> > > > > > fields.  It is possible to tweak the moreLikeThis function from
> > solr.
> > > >  I
> > > > > > have tried changing the parameters (
> > > > > > http://wiki.apache.org/solr/MoreLikeThis)
> > > > > > but am not managing to get results in under 300ms without
> > sacrificing
> > > > the
> > > > > > quality of the results too much.
> > > > > >
> > > > > > I suspect that there would be gains to be made from reducing the
> > > > > > dimensionality of the feature vectors before indexing with lucene
> > so
> > > I
> > > > > may
> > > > > > give that a try.  I'll keep you posted if I come up with other
> > > > solutions.
> > > > > >
> > > > > > Thanks,
> > > > > > Kris
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2010/6/8 Olivier Grisel <[email protected]>
> > > > > >
> > > > > > > 2010/6/8 Kris Jack <[email protected]>:
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > I currently use lucene's moreLikeThis function through solr
> to
> > > find
> > > > > > > > documents that are related to one another.  A single call,
> > > however,
> > > > > > takes
> > > > > > > > around 4 seconds to complete and I would like to reduce this.
> >  I
> > > > got
> > > > > to
> > > > > > > > thinking that I might be able to use Mahout to generate a
> > > document
> > > > > > > > similarity matrix offline that could then be looked-up in
> real
> > > time
> > > > > for
> > > > > > > > serving.  Is this a reasonable use of Mahout?  If so, what
> > > > functions
> > > > > > will
> > > > > > > > generate a document similarity matrix?  Also, I would like to
> > be
> > > > able
> > > > > > to
> > > > > > > > keep the text processing advantages provided through lucene
> so
> > it
> > > > > would
> > > > > > > help
> > > > > > > > if I could still use my lucene index.  If not, then could you
> > > > > recommend
> > > > > > > any
> > > > > > > > alternative solutions please?
> > > > > > >
> > > > > > > How many documents do you have in your index? Have you tried to
> > > tweak
> > > > > > > the MoreLikeThis parameters ? (I don't know if it's possible
> > using
> > > > the
> > > > > > > solr interface, I use it directly using the lucene java API)
> > > > > > >
> > > > > > > For instance you can trade off recall for speed by decreasing
> the
> > > > > > > number of terms to use in the query and trade recall for
> > precision
> > > > and
> > > > > > > speed by increasing the percentage of terms that should match.
> > > > > > >
> > > > > > > You could also use Mahout implementation of SVD to build low
> > > > > > > dimensional semantic vectors representing your documents
> (a.k.a.
> > > > > > > Latent Semantic Indexing) and then index those transformed
> > > frequency
> > > > > > > vectors in a dedicated lucene index (or document field provided
> > you
> > > > > > > name the resulting terms with something that does not match
> real
> > > life
> > > > > > > terms present in other). However using standard SVD will
> probably
> > > > > > > result in dense (as opposed to sparse) low dimensional semantic
> > > > > > > vectors. I don't think lucene's lookup performance is good with
> > > dense
> > > > > > > frequency vectors even though the number of terms is greatly
> > > reduced
> > > > > > > by SVD. Hence it would probably be better to either threshold
> the
> > > top
> > > > > > > 100 absolute values of each semantic vectors before indexing
> > > > (probably
> > > > > > > the simpler solution) or using a sparsifying penalty contrained
> > > > > > > variant of SVD / LSI. You should have a look at the literature
> on
> > > > > > > sparse coding or sparse dictionary learning, Sparse-PCA and
> more
> > > > > > > generally L1 penalty regression methods such as the Lasso and
> > LARS.
> > > I
> > > > > > > don't know about any library  for sparse semantic coding of
> > > document
> > > > > > > that works automatically with lucene. Probably some non trivial
> > > > coding
> > > > > > > is needed there.
> > > > > > >
> > > > > > > Another alternative is finding low dimensional (64 or 32
> > > components)
> > > > > > > dense codes and then binary thresholding then and store integer
> > > code
> > > > > > > in the DB or the lucene index and then build smart exact match
> > > > queries
> > > > > > > to find all document lying in the hamming ball of size 1 or 2
> of
> > > the
> > > > > > > reference document's binary code. But I think this approach
> while
> > > > > > > promising for web scale document collections is even more
> > > > experimental
> > > > > > > and requires very good code low dim encoders (I don't think
> > linear
> > > > > > > models such as SVD are good enough for reducing sparse 10e6
> > > > components
> > > > > > > vectors to dense 64 components vectors, non linear encoders
> such
> > as
> > > > > > > Stacked Restricted Boltzmann Machines are probably a better
> > > choice).
> > > > > > >
> > > > > > > In any case let us know about your results, I am really
> > interested
> > > on
> > > > > > > practical yet scalable solutions to this problem.
> > > > > > >
> > > > > > > --
> > > > > > > Olivier
> > > > > > > http://twitter.com/ogrisel - http://github.com/ogrisel
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Dr Kris Jack,
> > > > > > http://www.mendeley.com/profiles/kris-jack/
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Dr Kris Jack,
> > > > http://www.mendeley.com/profiles/kris-jack/
> > > >
> > >
> >
> >
> >
> > --
> > Dr Kris Jack,
> > http://www.mendeley.com/profiles/kris-jack/
> >
>



-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Re: Generating a Document Similarity Matrix

Reply via email to