Hi Jake, Thanks for that. The first solution that you suggest is more like what I was imagining.
Please excuse me, I'm new to Mahout and don't know how to use it to generate the full document-document similarity matrix. I would rather not have to re-implement the moreLikeThis algorithm that, although rather straight forward, may take time for a newbie to MapReduce like me. Could you guide me a little in finding the relevant Mahout code for generating the matrix or is it not really designed for that? For the moment, I would be happy to have an off-line batch version working. Also, it is desirable to take advantage of the text processing features that I have already configured using solr, so I would prefer to read in the feature vectors for the documents from a lucene index, as I am doing at present (e.g. http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ ). Thanks, Kris 2010/6/8 Jake Mannix <[email protected]> > Hi Kris, > > If you generate a full document-document similarity matrix offline, and > then make sure to sparsify the rows (trim off all similarities below a > threshold, or only take the top N for each row, etc...). Then encoding > these values directly in the index would indeed allow for *superfast* > MoreLikeThis functionality, because you've already computed all > of the similar results offline. > > The only downside is that it won't apply to newly indexed documents. > If your indexing setup is such that you don't fold in new documents live, > but do so in batch, then this should be fine. > > An alternative is to use something like a Locality Sensitive Hash > (something one of my co-workers is writing up a nice implementation > of now, and I'm going to get him to contribute it once it's fully tested), > to reduce the search space (as a lucene Filter) and speed up the > query. > > -jake > > On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <[email protected]> wrote: > > > Hi Olivier, > > > > Thanks for your suggestions. I have over 10 million documents and they > > have > > quite a lot of meta-data associated with them including rather large text > > fields. It is possible to tweak the moreLikeThis function from solr. I > > have tried changing the parameters ( > > http://wiki.apache.org/solr/MoreLikeThis) > > but am not managing to get results in under 300ms without sacrificing the > > quality of the results too much. > > > > I suspect that there would be gains to be made from reducing the > > dimensionality of the feature vectors before indexing with lucene so I > may > > give that a try. I'll keep you posted if I come up with other solutions. > > > > Thanks, > > Kris > > > > > > > > 2010/6/8 Olivier Grisel <[email protected]> > > > > > 2010/6/8 Kris Jack <[email protected]>: > > > > Hi everyone, > > > > > > > > I currently use lucene's moreLikeThis function through solr to find > > > > documents that are related to one another. A single call, however, > > takes > > > > around 4 seconds to complete and I would like to reduce this. I got > to > > > > thinking that I might be able to use Mahout to generate a document > > > > similarity matrix offline that could then be looked-up in real time > for > > > > serving. Is this a reasonable use of Mahout? If so, what functions > > will > > > > generate a document similarity matrix? Also, I would like to be able > > to > > > > keep the text processing advantages provided through lucene so it > would > > > help > > > > if I could still use my lucene index. If not, then could you > recommend > > > any > > > > alternative solutions please? > > > > > > How many documents do you have in your index? Have you tried to tweak > > > the MoreLikeThis parameters ? (I don't know if it's possible using the > > > solr interface, I use it directly using the lucene java API) > > > > > > For instance you can trade off recall for speed by decreasing the > > > number of terms to use in the query and trade recall for precision and > > > speed by increasing the percentage of terms that should match. > > > > > > You could also use Mahout implementation of SVD to build low > > > dimensional semantic vectors representing your documents (a.k.a. > > > Latent Semantic Indexing) and then index those transformed frequency > > > vectors in a dedicated lucene index (or document field provided you > > > name the resulting terms with something that does not match real life > > > terms present in other). However using standard SVD will probably > > > result in dense (as opposed to sparse) low dimensional semantic > > > vectors. I don't think lucene's lookup performance is good with dense > > > frequency vectors even though the number of terms is greatly reduced > > > by SVD. Hence it would probably be better to either threshold the top > > > 100 absolute values of each semantic vectors before indexing (probably > > > the simpler solution) or using a sparsifying penalty contrained > > > variant of SVD / LSI. You should have a look at the literature on > > > sparse coding or sparse dictionary learning, Sparse-PCA and more > > > generally L1 penalty regression methods such as the Lasso and LARS. I > > > don't know about any library for sparse semantic coding of document > > > that works automatically with lucene. Probably some non trivial coding > > > is needed there. > > > > > > Another alternative is finding low dimensional (64 or 32 components) > > > dense codes and then binary thresholding then and store integer code > > > in the DB or the lucene index and then build smart exact match queries > > > to find all document lying in the hamming ball of size 1 or 2 of the > > > reference document's binary code. But I think this approach while > > > promising for web scale document collections is even more experimental > > > and requires very good code low dim encoders (I don't think linear > > > models such as SVD are good enough for reducing sparse 10e6 components > > > vectors to dense 64 components vectors, non linear encoders such as > > > Stacked Restricted Boltzmann Machines are probably a better choice). > > > > > > In any case let us know about your results, I am really interested on > > > practical yet scalable solutions to this problem. > > > > > > -- > > > Olivier > > > http://twitter.com/ogrisel - http://github.com/ogrisel > > > > > > > > > > > -- > > Dr Kris Jack, > > http://www.mendeley.com/profiles/kris-jack/ > > > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/
