Charikar is the definitive reference for this method. See [1] Charikar, M., Similarity estimation techniques from rounding*.* In *Proceedings of the Symposium on Theory of Computing*, 2002.
I also created a simple LSH NN method based on this idea (refined, I think) which you can find here: Mcree, Symbolic Regression using Nearest Neighbor Indexing<http://portal.acm.org/citation.cfm?id=1830841&dl=ACM&coll=DL&CFID=17946055&CFTOKEN=18630366>Access to the paper requires ACM membership--let me know if you want me to send you a copy. The basic idea I introduced was to ignore random projections that do not fall into the TopN (funny how these ideas repeat in different contexts!). Random projections that are not in the TopN act like noise, and are not useful IMHO. Randy McRee On Sun, Apr 24, 2011 at 9:41 PM, Ted Dunning <[email protected]> wrote: > Sounds like a variant of LSH to me. > > See Wikipedia article on LSH with random projections. > > On Sun, Apr 24, 2011 at 8:56 PM, Lance Norskog <[email protected]> wrote: > > > I just found this vector distance idea in a technical paper: > > > > Create a space defined by X random vectors. For you data vectors, > > take the cosine distance to each random vector and save the sign of > > the value as a bit. This gives a bit set of X bits. > > There could be another distance and algorithm for picking the bit value. > > > > The effect is to cease using numerical vectors as a "carrier signal" > > for the concept of "positions and distances". This is a different, > > more focused representation. And, Hamming distance is somewhat faster > > than Euclidean :) Of course, picking enough bits is a problem. > > > > However, I lost the paper. Does this ring a bell with anyone? > > > > -- > > Lance Norskog > > [email protected] > > >
