Sorry, copied the previous link from the wrong tab :/ Meant to be https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html for reading in lucene vectors.
2010/6/8 Kris Jack <[email protected]> > Hi Jake, > > Thanks for that. The first solution that you suggest is more like what I > was imagining. > > Please excuse me, I'm new to Mahout and don't know how to use it to > generate the full document-document similarity matrix. I would rather not > have to re-implement the moreLikeThis algorithm that, although rather > straight forward, may take time for a newbie to MapReduce like me. Could > you guide me a little in finding the relevant Mahout code for generating the > matrix or is it not really designed for that? > > For the moment, I would be happy to have an off-line batch version > working. Also, it is desirable to take advantage of the text processing > features that I have already configured using solr, so I would prefer to > read in the feature vectors for the documents from a lucene index, as I am > doing at present (e.g. > http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ > ). > > Thanks, > Kris > > > > 2010/6/8 Jake Mannix <[email protected]> > > Hi Kris, >> >> If you generate a full document-document similarity matrix offline, and >> then make sure to sparsify the rows (trim off all similarities below a >> threshold, or only take the top N for each row, etc...). Then encoding >> these values directly in the index would indeed allow for *superfast* >> MoreLikeThis functionality, because you've already computed all >> of the similar results offline. >> >> The only downside is that it won't apply to newly indexed documents. >> If your indexing setup is such that you don't fold in new documents live, >> but do so in batch, then this should be fine. >> >> An alternative is to use something like a Locality Sensitive Hash >> (something one of my co-workers is writing up a nice implementation >> of now, and I'm going to get him to contribute it once it's fully tested), >> to reduce the search space (as a lucene Filter) and speed up the >> query. >> >> -jake >> >> On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <[email protected]> wrote: >> >> > Hi Olivier, >> > >> > Thanks for your suggestions. I have over 10 million documents and they >> > have >> > quite a lot of meta-data associated with them including rather large >> text >> > fields. It is possible to tweak the moreLikeThis function from solr. I >> > have tried changing the parameters ( >> > http://wiki.apache.org/solr/MoreLikeThis) >> > but am not managing to get results in under 300ms without sacrificing >> the >> > quality of the results too much. >> > >> > I suspect that there would be gains to be made from reducing the >> > dimensionality of the feature vectors before indexing with lucene so I >> may >> > give that a try. I'll keep you posted if I come up with other >> solutions. >> > >> > Thanks, >> > Kris >> > >> > >> > >> > 2010/6/8 Olivier Grisel <[email protected]> >> > >> > > 2010/6/8 Kris Jack <[email protected]>: >> > > > Hi everyone, >> > > > >> > > > I currently use lucene's moreLikeThis function through solr to find >> > > > documents that are related to one another. A single call, however, >> > takes >> > > > around 4 seconds to complete and I would like to reduce this. I got >> to >> > > > thinking that I might be able to use Mahout to generate a document >> > > > similarity matrix offline that could then be looked-up in real time >> for >> > > > serving. Is this a reasonable use of Mahout? If so, what functions >> > will >> > > > generate a document similarity matrix? Also, I would like to be >> able >> > to >> > > > keep the text processing advantages provided through lucene so it >> would >> > > help >> > > > if I could still use my lucene index. If not, then could you >> recommend >> > > any >> > > > alternative solutions please? >> > > >> > > How many documents do you have in your index? Have you tried to tweak >> > > the MoreLikeThis parameters ? (I don't know if it's possible using the >> > > solr interface, I use it directly using the lucene java API) >> > > >> > > For instance you can trade off recall for speed by decreasing the >> > > number of terms to use in the query and trade recall for precision and >> > > speed by increasing the percentage of terms that should match. >> > > >> > > You could also use Mahout implementation of SVD to build low >> > > dimensional semantic vectors representing your documents (a.k.a. >> > > Latent Semantic Indexing) and then index those transformed frequency >> > > vectors in a dedicated lucene index (or document field provided you >> > > name the resulting terms with something that does not match real life >> > > terms present in other). However using standard SVD will probably >> > > result in dense (as opposed to sparse) low dimensional semantic >> > > vectors. I don't think lucene's lookup performance is good with dense >> > > frequency vectors even though the number of terms is greatly reduced >> > > by SVD. Hence it would probably be better to either threshold the top >> > > 100 absolute values of each semantic vectors before indexing (probably >> > > the simpler solution) or using a sparsifying penalty contrained >> > > variant of SVD / LSI. You should have a look at the literature on >> > > sparse coding or sparse dictionary learning, Sparse-PCA and more >> > > generally L1 penalty regression methods such as the Lasso and LARS. I >> > > don't know about any library for sparse semantic coding of document >> > > that works automatically with lucene. Probably some non trivial coding >> > > is needed there. >> > > >> > > Another alternative is finding low dimensional (64 or 32 components) >> > > dense codes and then binary thresholding then and store integer code >> > > in the DB or the lucene index and then build smart exact match queries >> > > to find all document lying in the hamming ball of size 1 or 2 of the >> > > reference document's binary code. But I think this approach while >> > > promising for web scale document collections is even more experimental >> > > and requires very good code low dim encoders (I don't think linear >> > > models such as SVD are good enough for reducing sparse 10e6 components >> > > vectors to dense 64 components vectors, non linear encoders such as >> > > Stacked Restricted Boltzmann Machines are probably a better choice). >> > > >> > > In any case let us know about your results, I am really interested on >> > > practical yet scalable solutions to this problem. >> > > >> > > -- >> > > Olivier >> > > http://twitter.com/ogrisel - http://github.com/ogrisel >> > > >> > >> > >> > >> > -- >> > Dr Kris Jack, >> > http://www.mendeley.com/profiles/kris-jack/ >> > >> > > > > -- > Dr Kris Jack, > http://www.mendeley.com/profiles/kris-jack/ > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/
