2010/6/8 Jake Mannix <[email protected]>: > Hi Kris, > > If you generate a full document-document similarity matrix offline, and > then make sure to sparsify the rows (trim off all similarities below a > threshold, or only take the top N for each row, etc...). Then encoding > these values directly in the index would indeed allow for *superfast* > MoreLikeThis functionality, because you've already computed all > of the similar results offline.
For 10e6 documents if might not be reasonable to generate the complete document-document similarity matrix: 1e12 components => a couple of tera bytes of similarity values just to find the find the top N afterwards: sorting a tera byte of data can be fast when you have a datacenter like yahoos or googles but might not be reasonable when you just have a CMS running on a couple of servers :) Trimming off low similarities should happen before starting to writer the rows on the hard drive. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
