Re: Generating a Document Similarity Matrix

Kris Jack Tue, 08 Jun 2010 09:45:16 -0700

Hi Jake,

Thanks for that.  The first solution that you suggest is more like what I
was imagining.


Please excuse me, I'm new to Mahout and don't know how to use it to generate
the full document-document similarity matrix.  I would rather not have to
re-implement the moreLikeThis algorithm that, although rather straight
forward, may take time for a newbie to MapReduce like me.  Could you guide
me a little in finding the relevant Mahout code for generating the matrix or
is it not really designed for that?

For the moment, I would be happy to have an off-line batch version working.
Also, it is desirable to take advantage of the text processing features that
I have already configured using solr, so I would prefer to read in the
feature vectors for the documents from a lucene index, as I am doing at
present (e.g.
http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
).

Thanks,
Kris



2010/6/8 Jake Mannix <[email protected]>

> Hi Kris,
>
>  If you generate a full document-document similarity matrix offline, and
> then make sure to sparsify the rows (trim off all similarities below a
> threshold, or only take the top N for each row, etc...).  Then encoding
> these values directly in the index would indeed allow for *superfast*
> MoreLikeThis functionality, because you've already computed all
> of the similar results offline.
>
>  The only downside is that it won't apply to newly indexed documents.
> If your indexing setup is such that you don't fold in new documents live,
> but do so in batch, then this should be fine.
>
>  An alternative is to use something like a Locality Sensitive Hash
> (something one of my co-workers is writing up a nice implementation
> of now, and I'm going to get him to contribute it once it's fully tested),
> to reduce the search space (as a lucene Filter) and speed up the
> query.
>
>  -jake
>
> On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <[email protected]> wrote:
>
> > Hi Olivier,
> >
> > Thanks for your suggestions.  I have over 10 million documents and they
> > have
> > quite a lot of meta-data associated with them including rather large text
> > fields.  It is possible to tweak the moreLikeThis function from solr.  I
> > have tried changing the parameters (
> > http://wiki.apache.org/solr/MoreLikeThis)
> > but am not managing to get results in under 300ms without sacrificing the
> > quality of the results too much.
> >
> > I suspect that there would be gains to be made from reducing the
> > dimensionality of the feature vectors before indexing with lucene so I
> may
> > give that a try.  I'll keep you posted if I come up with other solutions.
> >
> > Thanks,
> > Kris
> >
> >
> >
> > 2010/6/8 Olivier Grisel <[email protected]>
> >
> > > 2010/6/8 Kris Jack <[email protected]>:
> > > > Hi everyone,
> > > >
> > > > I currently use lucene's moreLikeThis function through solr to find
> > > > documents that are related to one another.  A single call, however,
> > takes
> > > > around 4 seconds to complete and I would like to reduce this.  I got
> to
> > > > thinking that I might be able to use Mahout to generate a document
> > > > similarity matrix offline that could then be looked-up in real time
> for
> > > > serving.  Is this a reasonable use of Mahout?  If so, what functions
> > will
> > > > generate a document similarity matrix?  Also, I would like to be able
> > to
> > > > keep the text processing advantages provided through lucene so it
> would
> > > help
> > > > if I could still use my lucene index.  If not, then could you
> recommend
> > > any
> > > > alternative solutions please?
> > >
> > > How many documents do you have in your index? Have you tried to tweak
> > > the MoreLikeThis parameters ? (I don't know if it's possible using the
> > > solr interface, I use it directly using the lucene java API)
> > >
> > > For instance you can trade off recall for speed by decreasing the
> > > number of terms to use in the query and trade recall for precision and
> > > speed by increasing the percentage of terms that should match.
> > >
> > > You could also use Mahout implementation of SVD to build low
> > > dimensional semantic vectors representing your documents (a.k.a.
> > > Latent Semantic Indexing) and then index those transformed frequency
> > > vectors in a dedicated lucene index (or document field provided you
> > > name the resulting terms with something that does not match real life
> > > terms present in other). However using standard SVD will probably
> > > result in dense (as opposed to sparse) low dimensional semantic
> > > vectors. I don't think lucene's lookup performance is good with dense
> > > frequency vectors even though the number of terms is greatly reduced
> > > by SVD. Hence it would probably be better to either threshold the top
> > > 100 absolute values of each semantic vectors before indexing (probably
> > > the simpler solution) or using a sparsifying penalty contrained
> > > variant of SVD / LSI. You should have a look at the literature on
> > > sparse coding or sparse dictionary learning, Sparse-PCA and more
> > > generally L1 penalty regression methods such as the Lasso and LARS. I
> > > don't know about any library  for sparse semantic coding of document
> > > that works automatically with lucene. Probably some non trivial coding
> > > is needed there.
> > >
> > > Another alternative is finding low dimensional (64 or 32 components)
> > > dense codes and then binary thresholding then and store integer code
> > > in the DB or the lucene index and then build smart exact match queries
> > > to find all document lying in the hamming ball of size 1 or 2 of the
> > > reference document's binary code. But I think this approach while
> > > promising for web scale document collections is even more experimental
> > > and requires very good code low dim encoders (I don't think linear
> > > models such as SVD are good enough for reducing sparse 10e6 components
> > > vectors to dense 64 components vectors, non linear encoders such as
> > > Stacked Restricted Boltzmann Machines are probably a better choice).
> > >
> > > In any case let us know about your results, I am really interested on
> > > practical yet scalable solutions to this problem.
> > >
> > > --
> > > Olivier
> > > http://twitter.com/ogrisel - http://github.com/ogrisel
> > >
> >
> >
> >
> > --
> > Dr Kris Jack,
> > http://www.mendeley.com/profiles/kris-jack/
> >
>



-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Re: Generating a Document Similarity Matrix

Reply via email to