On Tue, Jun 8, 2010 at 3:56 PM, Olivier Grisel <[email protected]>wrote:
> 2010/6/8 Jake Mannix <[email protected]>: > > Hi Kris, > > > > If you generate a full document-document similarity matrix offline, and > > then make sure to sparsify the rows (trim off all similarities below a > > threshold, or only take the top N for each row, etc...). Then encoding > > these values directly in the index would indeed allow for *superfast* > > MoreLikeThis functionality, because you've already computed all > > of the similar results offline. > > For 10e6 documents if might not be reasonable to generate the complete > document-document similarity matrix: 1e12 components => a couple of > tera bytes of similarity values just to find the find the top N > afterwards: Nope, this isn't what happens in what I described: when you take a sparseDocumentMatrix.transpose().times(itself), the scaling does not go N^2*M, with N^2 outputs - the calculation is sparse, only computing the entries which are nonzero. If you pre-sparsify the documents a little (remove all terms which occur in more than 1% of all documents, or something like that), this sparse calculation is even faster - it scales as sum_{i=1...N}(k_i)^2, where k_i is the number of nonzero elements in document i. If all documents were the same length (k), then this scales as N*k^2, and the total number of nonzero entries in the output is far less than N^2 if k << N,M. Getting rid of the common terms (even *lots* of them) beforehand is still a very good idea. -jake
