For document-like things objects search using text retrieval-like techniques (but in batch) is good.
For reduced dimension document-like things then you need to go to alternative methods to do full scale nearest neighbor computations. With a strong metric like L_2, you can to all-points nearest neighbor in n log n time. This can also be done fairly simply to get an approximate using latent semantic hashing (LSH) or a k-means search. Both are supported by the new knn package (see github). LSH does not fix the quadratic issue all at once but with a bit of sorting, it can come pretty close. k-means search also does a good job. In most cases, what you need is not the single absolutely truly closest item. Instead you usually need a list of k items that has a good overlap with the nearest k items. You also may need to have the distribution of distances to actual nearest k items be similar to the distribution of distances to the returned nearest k items. LSH and k-means search can be tuned to give this to you for pretty large data sets pretty efficiently. For instance, I expect that 20 million queries against 25 million data points can be done in several hours on a ten to hundred node cluster (depending on your data). On Fri, Jul 13, 2012 at 1:32 PM, Sean Owen <[email protected]> wrote: > (Yes it would be pretty simply to hack it up to only compute for a > certain list of known items, not all pairs. That ought to change the > scaling factor dramatically as it would be more like linear in the > number of items. Brute force is probably not so bad here.) > > On Fri, Jul 13, 2012 at 10:47 PM, Pat Ferrel <[email protected]> > wrote: > > What I really need is to calculate the k most similar docs to a short > list, > > known ahead of time. I don't know of an algorithm to do this (other than > > brute force). It would take a realatively small set of docs and find > similar > > docs in a much much larger set. Rowsimilarity finds all pair-wise > > similarities. Strictly speaking I need only a tiny number of those. >
