Hi Olivier, Thanks for your suggestions. I have over 10 million documents and they have quite a lot of meta-data associated with them including rather large text fields. It is possible to tweak the moreLikeThis function from solr. I have tried changing the parameters (http://wiki.apache.org/solr/MoreLikeThis) but am not managing to get results in under 300ms without sacrificing the quality of the results too much.
I suspect that there would be gains to be made from reducing the dimensionality of the feature vectors before indexing with lucene so I may give that a try. I'll keep you posted if I come up with other solutions. Thanks, Kris 2010/6/8 Olivier Grisel <[email protected]> > 2010/6/8 Kris Jack <[email protected]>: > > Hi everyone, > > > > I currently use lucene's moreLikeThis function through solr to find > > documents that are related to one another. A single call, however, takes > > around 4 seconds to complete and I would like to reduce this. I got to > > thinking that I might be able to use Mahout to generate a document > > similarity matrix offline that could then be looked-up in real time for > > serving. Is this a reasonable use of Mahout? If so, what functions will > > generate a document similarity matrix? Also, I would like to be able to > > keep the text processing advantages provided through lucene so it would > help > > if I could still use my lucene index. If not, then could you recommend > any > > alternative solutions please? > > How many documents do you have in your index? Have you tried to tweak > the MoreLikeThis parameters ? (I don't know if it's possible using the > solr interface, I use it directly using the lucene java API) > > For instance you can trade off recall for speed by decreasing the > number of terms to use in the query and trade recall for precision and > speed by increasing the percentage of terms that should match. > > You could also use Mahout implementation of SVD to build low > dimensional semantic vectors representing your documents (a.k.a. > Latent Semantic Indexing) and then index those transformed frequency > vectors in a dedicated lucene index (or document field provided you > name the resulting terms with something that does not match real life > terms present in other). However using standard SVD will probably > result in dense (as opposed to sparse) low dimensional semantic > vectors. I don't think lucene's lookup performance is good with dense > frequency vectors even though the number of terms is greatly reduced > by SVD. Hence it would probably be better to either threshold the top > 100 absolute values of each semantic vectors before indexing (probably > the simpler solution) or using a sparsifying penalty contrained > variant of SVD / LSI. You should have a look at the literature on > sparse coding or sparse dictionary learning, Sparse-PCA and more > generally L1 penalty regression methods such as the Lasso and LARS. I > don't know about any library for sparse semantic coding of document > that works automatically with lucene. Probably some non trivial coding > is needed there. > > Another alternative is finding low dimensional (64 or 32 components) > dense codes and then binary thresholding then and store integer code > in the DB or the lucene index and then build smart exact match queries > to find all document lying in the hamming ball of size 1 or 2 of the > reference document's binary code. But I think this approach while > promising for web scale document collections is even more experimental > and requires very good code low dim encoders (I don't think linear > models such as SVD are good enough for reducing sparse 10e6 components > vectors to dense 64 components vectors, non linear encoders such as > Stacked Restricted Boltzmann Machines are probably a better choice). > > In any case let us know about your results, I am really interested on > practical yet scalable solutions to this problem. > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/
