Re: Generating a Document Similarity Matrix

Olivier Grisel Tue, 08 Jun 2010 07:32:17 -0700

2010/6/8 Kris Jack <[email protected]>:
> Hi everyone,
>
> I currently use lucene's moreLikeThis function through solr to find
> documents that are related to one another.  A single call, however, takes
> around 4 seconds to complete and I would like to reduce this.  I got to
> thinking that I might be able to use Mahout to generate a document
> similarity matrix offline that could then be looked-up in real time for
> serving.  Is this a reasonable use of Mahout?  If so, what functions will
> generate a document similarity matrix?  Also, I would like to be able to
> keep the text processing advantages provided through lucene so it would help
> if I could still use my lucene index.  If not, then could you recommend any
> alternative solutions please?


How many documents do you have in your index? Have you tried to tweak
the MoreLikeThis parameters ? (I don't know if it's possible using the
solr interface, I use it directly using the lucene java API)

For instance you can trade off recall for speed by decreasing the
number of terms to use in the query and trade recall for precision and
speed by increasing the percentage of terms that should match.

You could also use Mahout implementation of SVD to build low
dimensional semantic vectors representing your documents (a.k.a.
Latent Semantic Indexing) and then index those transformed frequency
vectors in a dedicated lucene index (or document field provided you
name the resulting terms with something that does not match real life
terms present in other). However using standard SVD will probably
result in dense (as opposed to sparse) low dimensional semantic
vectors. I don't think lucene's lookup performance is good with dense
frequency vectors even though the number of terms is greatly reduced
by SVD. Hence it would probably be better to either threshold the top
100 absolute values of each semantic vectors before indexing (probably
the simpler solution) or using a sparsifying penalty contrained
variant of SVD / LSI. You should have a look at the literature on
sparse coding or sparse dictionary learning, Sparse-PCA and more
generally L1 penalty regression methods such as the Lasso and LARS. I
don't know about any library  for sparse semantic coding of document
that works automatically with lucene. Probably some non trivial coding
is needed there.

Another alternative is finding low dimensional (64 or 32 components)
dense codes and then binary thresholding then and store integer code
in the DB or the lucene index and then build smart exact match queries
to find all document lying in the hamming ball of size 1 or 2 of the
reference document's binary code. But I think this approach while
promising for web scale document collections is even more experimental
and requires very good code low dim encoders (I don't think linear
models such as SVD are good enough for reducing sparse 10e6 components
vectors to dense 64 components vectors, non linear encoders such as
Stacked Restricted Boltzmann Machines are probably a better choice).

In any case let us know about your results, I am really interested on
practical yet scalable solutions to this problem.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: Generating a Document Similarity Matrix

Reply via email to