Some minor testing, based on previous mail threads (thanks Ted-I-mean-everyone).
A Locality-Sensitive-Hash space is defined 50 random vectors (0->1). A sample vector's position in the LSH space is a bitset. Each entry the sign of the cosine distance to that LSH vector. Thus every LSH distance has 50 bits. This space of course has no negative vectors. Test data: 1000 random vectors as samples. All values 0->1, linear distribution. This test data gives no negative cosine distances, and so all bits are 0. This is expected (from previous mail threads). Now, define the random vectors in the space as the dimension-wise delta between two random vectors. This gives vector values ranging from 0.5 to -0.5; The same test of 1000 random vectors gives Hamming distances of 935 to 950 (non-matching) bits. Here is the OnlineSummarizer output: [(count=1000.0),(sd=4.15466),(mean=938.856),(min=928.0),(25%=935.935),(median=938.932),(75%=941.725),(max=951.0),] Changing the sample vectors from linear to Gaussian, the distances range from 915 to 960. The mid quartiles are farther from the median. [(count=1000.0),(sd=5.52686),(mean=935.912),(min=916.0),(25%=931.871),(median=935.810),(75%=939.730),(max=961.0),] Given the central limit theorem blah blah blah, this seems appropriate. Should I spin a patch? Where would this go? I have dense&sparse implementations, a Vector wrapper and some unit tests. Next week in Jerusalem, Lance Norskog [email protected]
