Re: MinHash implementation

Jeff Hansen Tue, 16 Aug 2011 09:13:40 -0700

I just looked at the initial JIRA to create this implementation and saw the
example code that uses it --
https://issues.apache.org/jira/browse/MAHOUT-344


The LastfmDataConverter class is indeed creating a vector with the indices
stored in the values and spurious information stored in the indices:

        Vector featureVector = new
SequentialAccessSparseVector(numfeatures);
        int i = 0;
        for (Integer feature : itemFeature.getValue()) {
          featureVector.setQuick(i++, feature);
        }

As such, that explains why the implementation was written the way it was.  I
imagine the implementer just didn't understand the point of sparse vectors
as they might as well have used a dense vector given their implementation.
 I think this would be best reimplemented using the actual indices of sparse
vectors for consistency -- most of the utilities to generate vectors follow
that approach (seq2sparse).

By the way, given that I never got any responses two months ago when I asked
this question, I'm assuming this is probably the wrong mailing list for
something like this -- should I have sent this to the developers mailing
list?  Or would it have been better just to go ahead and submit a JIRA?

Thanks!

On Tue, Aug 16, 2011 at 3:08 AM, Sean Owen <[email protected]> wrote:

> I'm not the authoritative voice here, but I would also agree with your
> interpretation -- it's indices rather than values that I'd use.
> I can imagine using min-hash on values, but that would not seem to be
> the most natural thing to do.
>
> (I don't understand the comment about set and get(). Vectors aren't
> sets, and whether it's sparse or not shouldn't decide whether you want
> values or indices.)
>
> On Tue, Aug 16, 2011 at 7:23 AM, 刘鎏 <[email protected]> wrote:
> > I think, if your input vector is a set, the ele.get() should be used,
> > instead, if your input vector is a sparse vector, the ele.index() would
> be
> > used.
> >
> > Pls correct me if I'm wrong.
> >
> >  for (int i = 0; i < numHashFunctions; i++) {
> >     for (Vector.Element ele : featureVector) {
> >
> > /// Shouldn't the following line say ele.index();
> >       int value = (int) ele.get();
> >
> >       bytesToHash[0] = (byte) (value >> 24);
> >       bytesToHash[1] = (byte) (value >> 16);
> >       bytesToHash[2] = (byte) (value >> 8);
> >       bytesToHash[3] = (byte) value;
> >       int hashIndex = hashFunction[i].hash(bytesToHash);
> >       if (minHashValues[i] > hashIndex) {
> >         minHashValues[i] = hashIndex;
> >       }
> >     }
> >   }
> >
> > On Fri, Jun 10, 2011 at 6:53 AM, Jeff Hansen <[email protected]> wrote:
> >
> >> I'm having a little trouble understanding Mahout's minhash
> implementation.
> >>
> >> Please correct me if I'm wrong, but the general intent of minhash is to
> >> evaluate the similarity of two sparse feature vectors based on the
> features
> >> they have in common, not necessarily the value of those features (as the
> >> values are often 1 or 0 and 0 values simply aren't tracked in the sparse
> >> vector).  So given a space of 10 dimensions, if Jack had features 4 and
> 6
> >> and Jill had features 5 and 6, Jacks vector would look something like
> {4:1,
> >> 6:1} and Jill's would like like {5:1, 6:1}.  Since they have 1/3 total
> >> features in common, their Jaccard coefficient is 1/3.  Also, given K
> random
> >> hash functions, we would expect about a third of them to return a
> minimum
> >> value for each of the three keys 4, 5 and 6 and thus about a third of
> them
> >> would also return the same minimum value for {4, 6} and {5, 6} (i.e. the
> >> third that return a minimum hash value for the key 6).  That's my basic
> >> English explanation of the purpose of minhash -- again, somebody please
> >> correct me if I'm wrong.
> >>
> >> Given that understanding, can somebody explain why Mahout's minhash
> >> implentation is hasing the values from the feature vectors rather than
> the
> >> keys?
> >>
> >> See the following code from MinHashMapper.java
> >>
> >>    for (int i = 0; i < numHashFunctions; i++) {
> >>      for (Vector.Element ele : featureVector) {
> >>
> >> /// Shouldn't the following line say ele.index();
> >>        int value = (int) ele.get();
> >>
> >>        bytesToHash[0] = (byte) (value >> 24);
> >>        bytesToHash[1] = (byte) (value >> 16);
> >>        bytesToHash[2] = (byte) (value >> 8);
> >>        bytesToHash[3] = (byte) value;
> >>        int hashIndex = hashFunction[i].hash(bytesToHash);
> >>        if (minHashValues[i] > hashIndex) {
> >>          minHashValues[i] = hashIndex;
> >>        }
> >>      }
> >>    }
> >>
> >> The code in TestMinHashClustering also seems to written with the
> >> expectation
> >> that minhash should be hashing the values rather than the keys.  Am I
> >> reading something wrong here?  Is this the intended use?  Are we
> supposed
> >> to
> >> be putting the feature ids into the double value fields of the feature
> >> vectors?
> >>
> >> Thanks,
> >> Jeff
> >>
> >
> >
> >
> > --
> >
>

Re: MinHash implementation

Reply via email to