Re: Generating a Document Similarity Matrix

Olivier Grisel Tue, 08 Jun 2010 15:57:26 -0700

2010/6/8 Jake Mannix <[email protected]>:
> Hi Kris,
>
>  If you generate a full document-document similarity matrix offline, and
> then make sure to sparsify the rows (trim off all similarities below a
> threshold, or only take the top N for each row, etc...).  Then encoding
> these values directly in the index would indeed allow for *superfast*
> MoreLikeThis functionality, because you've already computed all
> of the similar results offline.


For 10e6 documents if might not be reasonable to generate the complete
document-document similarity matrix: 1e12 components => a couple of
tera bytes of similarity values just to find the find the top N
afterwards: sorting a tera byte of data can be fast when you have a
datacenter like yahoos or googles but might not be reasonable when you
just have a CMS running on a couple of servers :)

Trimming off low similarities should happen before starting to writer
the rows on the hard drive.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: Generating a Document Similarity Matrix

Reply via email to