Re: MinHash implementation

Sean Owen Tue, 16 Aug 2011 09:16:42 -0700

It's the right place -- best-effort question-answering is not always that good.


A JIRA is a good thing if you have a specific idea of the issue /
enhancement and ideally a proposed patch. That is tracked with some
loose regularity and so might get more attention.

On Tue, Aug 16, 2011 at 5:13 PM, Jeff Hansen <[email protected]> wrote:
> I just looked at the initial JIRA to create this implementation and saw the
> example code that uses it --
> https://issues.apache.org/jira/browse/MAHOUT-344
>
> The LastfmDataConverter class is indeed creating a vector with the indices
> stored in the values and spurious information stored in the indices:
>
>        Vector featureVector = new
> SequentialAccessSparseVector(numfeatures);
>        int i = 0;
>        for (Integer feature : itemFeature.getValue()) {
>          featureVector.setQuick(i++, feature);
>        }
>
> As such, that explains why the implementation was written the way it was.  I
> imagine the implementer just didn't understand the point of sparse vectors
> as they might as well have used a dense vector given their implementation.
>  I think this would be best reimplemented using the actual indices of sparse
> vectors for consistency -- most of the utilities to generate vectors follow
> that approach (seq2sparse).
>
> By the way, given that I never got any responses two months ago when I asked
> this question, I'm assuming this is probably the wrong mailing list for
> something like this -- should I have sent this to the developers mailing
> list?  Or would it have been better just to go ahead and submit a JIRA?
>
> Thanks!
>
> On Tue, Aug 16, 2011 at 3:08 AM, Sean Owen <[email protected]> wrote:
>
>> I'm not the authoritative voice here, but I would also agree with your
>> interpretation -- it's indices rather than values that I'd use.
>> I can imagine using min-hash on values, but that would not seem to be
>> the most natural thing to do.
>>
>> (I don't understand the comment about set and get(). Vectors aren't
>> sets, and whether it's sparse or not shouldn't decide whether you want
>> values or indices.)
>>
>> On Tue, Aug 16, 2011 at 7:23 AM, 刘鎏 <[email protected]> wrote:
>> > I think, if your input vector is a set, the ele.get() should be used,
>> > instead, if your input vector is a sparse vector, the ele.index() would
>> be
>> > used.
>> >
>> > Pls correct me if I'm wrong.
>> >
>> >  for (int i = 0; i < numHashFunctions; i++) {
>> >     for (Vector.Element ele : featureVector) {
>> >
>> > /// Shouldn't the following line say ele.index();
>> >       int value = (int) ele.get();
>> >
>> >       bytesToHash[0] = (byte) (value >> 24);
>> >       bytesToHash[1] = (byte) (value >> 16);
>> >       bytesToHash[2] = (byte) (value >> 8);
>> >       bytesToHash[3] = (byte) value;
>> >       int hashIndex = hashFunction[i].hash(bytesToHash);
>> >       if (minHashValues[i] > hashIndex) {
>> >         minHashValues[i] = hashIndex;
>> >       }
>> >     }
>> >   }
>> >
>> > On Fri, Jun 10, 2011 at 6:53 AM, Jeff Hansen <[email protected]> wrote:
>> >
>> >> I'm having a little trouble understanding Mahout's minhash
>> implementation.
>> >>
>> >> Please correct me if I'm wrong, but the general intent of minhash is to
>> >> evaluate the similarity of two sparse feature vectors based on the
>> features
>> >> they have in common, not necessarily the value of those features (as the
>> >> values are often 1 or 0 and 0 values simply aren't tracked in the sparse
>> >> vector).  So given a space of 10 dimensions, if Jack had features 4 and
>> 6
>> >> and Jill had features 5 and 6, Jacks vector would look something like
>> {4:1,
>> >> 6:1} and Jill's would like like {5:1, 6:1}.  Since they have 1/3 total
>> >> features in common, their Jaccard coefficient is 1/3.  Also, given K
>> random
>> >> hash functions, we would expect about a third of them to return a
>> minimum
>> >> value for each of the three keys 4, 5 and 6 and thus about a third of
>> them
>> >> would also return the same minimum value for {4, 6} and {5, 6} (i.e. the
>> >> third that return a minimum hash value for the key 6).  That's my basic
>> >> English explanation of the purpose of minhash -- again, somebody please
>> >> correct me if I'm wrong.
>> >>
>> >> Given that understanding, can somebody explain why Mahout's minhash
>> >> implentation is hasing the values from the feature vectors rather than
>> the
>> >> keys?
>> >>
>> >> See the following code from MinHashMapper.java
>> >>
>> >>    for (int i = 0; i < numHashFunctions; i++) {
>> >>      for (Vector.Element ele : featureVector) {
>> >>
>> >> /// Shouldn't the following line say ele.index();
>> >>        int value = (int) ele.get();
>> >>
>> >>        bytesToHash[0] = (byte) (value >> 24);
>> >>        bytesToHash[1] = (byte) (value >> 16);
>> >>        bytesToHash[2] = (byte) (value >> 8);
>> >>        bytesToHash[3] = (byte) value;
>> >>        int hashIndex = hashFunction[i].hash(bytesToHash);
>> >>        if (minHashValues[i] > hashIndex) {
>> >>          minHashValues[i] = hashIndex;
>> >>        }
>> >>      }
>> >>    }
>> >>
>> >> The code in TestMinHashClustering also seems to written with the
>> >> expectation
>> >> that minhash should be hashing the values rather than the keys.  Am I
>> >> reading something wrong here?  Is this the intended use?  Are we
>> supposed
>> >> to
>> >> be putting the feature ids into the double value fields of the feature
>> >> vectors?
>> >>
>> >> Thanks,
>> >> Jeff
>> >>
>> >
>> >
>> >
>> > --
>> >
>>
>

Re: MinHash implementation

Reply via email to