I think, if your input vector is a set, the ele.get() should be used,
instead, if your input vector is a sparse vector, the ele.index() would be
used.
Pls correct me if I'm wrong.
for (int i = 0; i < numHashFunctions; i++) {
for (Vector.Element ele : featureVector) {
/// Shouldn't the following line say ele.index();
int value = (int) ele.get();
bytesToHash[0] = (byte) (value >> 24);
bytesToHash[1] = (byte) (value >> 16);
bytesToHash[2] = (byte) (value >> 8);
bytesToHash[3] = (byte) value;
int hashIndex = hashFunction[i].hash(bytesToHash);
if (minHashValues[i] > hashIndex) {
minHashValues[i] = hashIndex;
}
}
}
On Fri, Jun 10, 2011 at 6:53 AM, Jeff Hansen <[email protected]> wrote:
> I'm having a little trouble understanding Mahout's minhash implementation.
>
> Please correct me if I'm wrong, but the general intent of minhash is to
> evaluate the similarity of two sparse feature vectors based on the features
> they have in common, not necessarily the value of those features (as the
> values are often 1 or 0 and 0 values simply aren't tracked in the sparse
> vector). So given a space of 10 dimensions, if Jack had features 4 and 6
> and Jill had features 5 and 6, Jacks vector would look something like {4:1,
> 6:1} and Jill's would like like {5:1, 6:1}. Since they have 1/3 total
> features in common, their Jaccard coefficient is 1/3. Also, given K random
> hash functions, we would expect about a third of them to return a minimum
> value for each of the three keys 4, 5 and 6 and thus about a third of them
> would also return the same minimum value for {4, 6} and {5, 6} (i.e. the
> third that return a minimum hash value for the key 6). That's my basic
> English explanation of the purpose of minhash -- again, somebody please
> correct me if I'm wrong.
>
> Given that understanding, can somebody explain why Mahout's minhash
> implentation is hasing the values from the feature vectors rather than the
> keys?
>
> See the following code from MinHashMapper.java
>
> for (int i = 0; i < numHashFunctions; i++) {
> for (Vector.Element ele : featureVector) {
>
> /// Shouldn't the following line say ele.index();
> int value = (int) ele.get();
>
> bytesToHash[0] = (byte) (value >> 24);
> bytesToHash[1] = (byte) (value >> 16);
> bytesToHash[2] = (byte) (value >> 8);
> bytesToHash[3] = (byte) value;
> int hashIndex = hashFunction[i].hash(bytesToHash);
> if (minHashValues[i] > hashIndex) {
> minHashValues[i] = hashIndex;
> }
> }
> }
>
> The code in TestMinHashClustering also seems to written with the
> expectation
> that minhash should be hashing the values rather than the keys. Am I
> reading something wrong here? Is this the intended use? Are we supposed
> to
> be putting the feature ids into the double value fields of the feature
> vectors?
>
> Thanks,
> Jeff
>
--