It's the right place -- best-effort question-answering is not always that good.
A JIRA is a good thing if you have a specific idea of the issue / enhancement and ideally a proposed patch. That is tracked with some loose regularity and so might get more attention. On Tue, Aug 16, 2011 at 5:13 PM, Jeff Hansen <[email protected]> wrote: > I just looked at the initial JIRA to create this implementation and saw the > example code that uses it -- > https://issues.apache.org/jira/browse/MAHOUT-344 > > The LastfmDataConverter class is indeed creating a vector with the indices > stored in the values and spurious information stored in the indices: > > Vector featureVector = new > SequentialAccessSparseVector(numfeatures); > int i = 0; > for (Integer feature : itemFeature.getValue()) { > featureVector.setQuick(i++, feature); > } > > As such, that explains why the implementation was written the way it was. I > imagine the implementer just didn't understand the point of sparse vectors > as they might as well have used a dense vector given their implementation. > I think this would be best reimplemented using the actual indices of sparse > vectors for consistency -- most of the utilities to generate vectors follow > that approach (seq2sparse). > > By the way, given that I never got any responses two months ago when I asked > this question, I'm assuming this is probably the wrong mailing list for > something like this -- should I have sent this to the developers mailing > list? Or would it have been better just to go ahead and submit a JIRA? > > Thanks! > > On Tue, Aug 16, 2011 at 3:08 AM, Sean Owen <[email protected]> wrote: > >> I'm not the authoritative voice here, but I would also agree with your >> interpretation -- it's indices rather than values that I'd use. >> I can imagine using min-hash on values, but that would not seem to be >> the most natural thing to do. >> >> (I don't understand the comment about set and get(). Vectors aren't >> sets, and whether it's sparse or not shouldn't decide whether you want >> values or indices.) >> >> On Tue, Aug 16, 2011 at 7:23 AM, 刘鎏 <[email protected]> wrote: >> > I think, if your input vector is a set, the ele.get() should be used, >> > instead, if your input vector is a sparse vector, the ele.index() would >> be >> > used. >> > >> > Pls correct me if I'm wrong. >> > >> > for (int i = 0; i < numHashFunctions; i++) { >> > for (Vector.Element ele : featureVector) { >> > >> > /// Shouldn't the following line say ele.index(); >> > int value = (int) ele.get(); >> > >> > bytesToHash[0] = (byte) (value >> 24); >> > bytesToHash[1] = (byte) (value >> 16); >> > bytesToHash[2] = (byte) (value >> 8); >> > bytesToHash[3] = (byte) value; >> > int hashIndex = hashFunction[i].hash(bytesToHash); >> > if (minHashValues[i] > hashIndex) { >> > minHashValues[i] = hashIndex; >> > } >> > } >> > } >> > >> > On Fri, Jun 10, 2011 at 6:53 AM, Jeff Hansen <[email protected]> wrote: >> > >> >> I'm having a little trouble understanding Mahout's minhash >> implementation. >> >> >> >> Please correct me if I'm wrong, but the general intent of minhash is to >> >> evaluate the similarity of two sparse feature vectors based on the >> features >> >> they have in common, not necessarily the value of those features (as the >> >> values are often 1 or 0 and 0 values simply aren't tracked in the sparse >> >> vector). So given a space of 10 dimensions, if Jack had features 4 and >> 6 >> >> and Jill had features 5 and 6, Jacks vector would look something like >> {4:1, >> >> 6:1} and Jill's would like like {5:1, 6:1}. Since they have 1/3 total >> >> features in common, their Jaccard coefficient is 1/3. Also, given K >> random >> >> hash functions, we would expect about a third of them to return a >> minimum >> >> value for each of the three keys 4, 5 and 6 and thus about a third of >> them >> >> would also return the same minimum value for {4, 6} and {5, 6} (i.e. the >> >> third that return a minimum hash value for the key 6). That's my basic >> >> English explanation of the purpose of minhash -- again, somebody please >> >> correct me if I'm wrong. >> >> >> >> Given that understanding, can somebody explain why Mahout's minhash >> >> implentation is hasing the values from the feature vectors rather than >> the >> >> keys? >> >> >> >> See the following code from MinHashMapper.java >> >> >> >> for (int i = 0; i < numHashFunctions; i++) { >> >> for (Vector.Element ele : featureVector) { >> >> >> >> /// Shouldn't the following line say ele.index(); >> >> int value = (int) ele.get(); >> >> >> >> bytesToHash[0] = (byte) (value >> 24); >> >> bytesToHash[1] = (byte) (value >> 16); >> >> bytesToHash[2] = (byte) (value >> 8); >> >> bytesToHash[3] = (byte) value; >> >> int hashIndex = hashFunction[i].hash(bytesToHash); >> >> if (minHashValues[i] > hashIndex) { >> >> minHashValues[i] = hashIndex; >> >> } >> >> } >> >> } >> >> >> >> The code in TestMinHashClustering also seems to written with the >> >> expectation >> >> that minhash should be hashing the values rather than the keys. Am I >> >> reading something wrong here? Is this the intended use? Are we >> supposed >> >> to >> >> be putting the feature ids into the double value fields of the feature >> >> vectors? >> >> >> >> Thanks, >> >> Jeff >> >> >> > >> > >> > >> > -- >> > >> >
