I just looked at the initial JIRA to create this implementation and saw the
example code that uses it --
https://issues.apache.org/jira/browse/MAHOUT-344
The LastfmDataConverter class is indeed creating a vector with the indices
stored in the values and spurious information stored in the indices:
Vector featureVector = new
SequentialAccessSparseVector(numfeatures);
int i = 0;
for (Integer feature : itemFeature.getValue()) {
featureVector.setQuick(i++, feature);
}
As such, that explains why the implementation was written the way it was. I
imagine the implementer just didn't understand the point of sparse vectors
as they might as well have used a dense vector given their implementation.
I think this would be best reimplemented using the actual indices of sparse
vectors for consistency -- most of the utilities to generate vectors follow
that approach (seq2sparse).
By the way, given that I never got any responses two months ago when I asked
this question, I'm assuming this is probably the wrong mailing list for
something like this -- should I have sent this to the developers mailing
list? Or would it have been better just to go ahead and submit a JIRA?
Thanks!
On Tue, Aug 16, 2011 at 3:08 AM, Sean Owen <[email protected]> wrote:
> I'm not the authoritative voice here, but I would also agree with your
> interpretation -- it's indices rather than values that I'd use.
> I can imagine using min-hash on values, but that would not seem to be
> the most natural thing to do.
>
> (I don't understand the comment about set and get(). Vectors aren't
> sets, and whether it's sparse or not shouldn't decide whether you want
> values or indices.)
>
> On Tue, Aug 16, 2011 at 7:23 AM, 刘鎏 <[email protected]> wrote:
> > I think, if your input vector is a set, the ele.get() should be used,
> > instead, if your input vector is a sparse vector, the ele.index() would
> be
> > used.
> >
> > Pls correct me if I'm wrong.
> >
> > for (int i = 0; i < numHashFunctions; i++) {
> > for (Vector.Element ele : featureVector) {
> >
> > /// Shouldn't the following line say ele.index();
> > int value = (int) ele.get();
> >
> > bytesToHash[0] = (byte) (value >> 24);
> > bytesToHash[1] = (byte) (value >> 16);
> > bytesToHash[2] = (byte) (value >> 8);
> > bytesToHash[3] = (byte) value;
> > int hashIndex = hashFunction[i].hash(bytesToHash);
> > if (minHashValues[i] > hashIndex) {
> > minHashValues[i] = hashIndex;
> > }
> > }
> > }
> >
> > On Fri, Jun 10, 2011 at 6:53 AM, Jeff Hansen <[email protected]> wrote:
> >
> >> I'm having a little trouble understanding Mahout's minhash
> implementation.
> >>
> >> Please correct me if I'm wrong, but the general intent of minhash is to
> >> evaluate the similarity of two sparse feature vectors based on the
> features
> >> they have in common, not necessarily the value of those features (as the
> >> values are often 1 or 0 and 0 values simply aren't tracked in the sparse
> >> vector). So given a space of 10 dimensions, if Jack had features 4 and
> 6
> >> and Jill had features 5 and 6, Jacks vector would look something like
> {4:1,
> >> 6:1} and Jill's would like like {5:1, 6:1}. Since they have 1/3 total
> >> features in common, their Jaccard coefficient is 1/3. Also, given K
> random
> >> hash functions, we would expect about a third of them to return a
> minimum
> >> value for each of the three keys 4, 5 and 6 and thus about a third of
> them
> >> would also return the same minimum value for {4, 6} and {5, 6} (i.e. the
> >> third that return a minimum hash value for the key 6). That's my basic
> >> English explanation of the purpose of minhash -- again, somebody please
> >> correct me if I'm wrong.
> >>
> >> Given that understanding, can somebody explain why Mahout's minhash
> >> implentation is hasing the values from the feature vectors rather than
> the
> >> keys?
> >>
> >> See the following code from MinHashMapper.java
> >>
> >> for (int i = 0; i < numHashFunctions; i++) {
> >> for (Vector.Element ele : featureVector) {
> >>
> >> /// Shouldn't the following line say ele.index();
> >> int value = (int) ele.get();
> >>
> >> bytesToHash[0] = (byte) (value >> 24);
> >> bytesToHash[1] = (byte) (value >> 16);
> >> bytesToHash[2] = (byte) (value >> 8);
> >> bytesToHash[3] = (byte) value;
> >> int hashIndex = hashFunction[i].hash(bytesToHash);
> >> if (minHashValues[i] > hashIndex) {
> >> minHashValues[i] = hashIndex;
> >> }
> >> }
> >> }
> >>
> >> The code in TestMinHashClustering also seems to written with the
> >> expectation
> >> that minhash should be hashing the values rather than the keys. Am I
> >> reading something wrong here? Is this the intended use? Are we
> supposed
> >> to
> >> be putting the feature ids into the double value fields of the feature
> >> vectors?
> >>
> >> Thanks,
> >> Jeff
> >>
> >
> >
> >
> > --
> >
>