Re: RowSimilarity

Ted Dunning Fri, 13 Jul 2012 14:28:59 -0700

For document-like things objects search using text retrieval-like
techniques (but in batch) is good.

For reduced dimension document-like things then you need to go to
alternative methods to do full scale nearest neighbor computations.  With a
strong metric like L_2, you can to all-points nearest neighbor in n log n
time.  This can also be done fairly simply to get an approximate using
latent semantic hashing (LSH) or a k-means search.  Both are supported by
the new knn package (see github).

LSH does not fix the quadratic issue all at once but with a bit of sorting,
it can come pretty close.  k-means search also does a good job.

In most cases, what you need is not the single absolutely truly closest
item.  Instead you usually need a list of k items that has a good overlap
with the nearest k items.  You also may need to have the distribution of
distances to actual nearest k items be similar to the distribution of
distances to the returned nearest k items.  LSH and k-means search can be
tuned to give this to you for pretty large data sets pretty efficiently.
 For instance, I expect that 20 million queries against 25 million data
points can be done in several hours on a ten to hundred node cluster
(depending on your data).

On Fri, Jul 13, 2012 at 1:32 PM, Sean Owen <[email protected]> wrote:

> (Yes it would be pretty simply to hack it up to only compute for a
> certain list of known items, not all pairs. That ought to change the
> scaling factor dramatically as it would be more like linear in the
> number of items. Brute force is probably not so bad here.)
>
> On Fri, Jul 13, 2012 at 10:47 PM, Pat Ferrel <[email protected]>
> wrote:
> > What I really need is to calculate the k most similar docs to a short
> list,
> > known ahead of time. I don't know of an algorithm to do this (other than
> > brute force). It would take a realatively small set of docs and find
> similar
> > docs in a much much larger set. Rowsimilarity finds all pair-wise
> > similarities. Strictly speaking I need only a tiny number of those.
>

Re: RowSimilarity

Reply via email to