Hello,
It seems to me that there is an issue in MinHashMapper class. In map
method, the loop goes over the elements in the vector. In many cases the
instance of Vector abstract class is a SparseVector and iteration would
meant to be over non-zeros values (e.g., documents as a sparse vector of
I think that's right, though I think the effect on correctness is quite
small, but the effect on performance is large. This will always hash zero
even if zero were not really present in the vector. That is not likely to
produce the smallest hash value though.
Hashing all those zeroes is wasteful
I agree about performance effect of iterating over zeros. But the
correctness effect comes due to hashing values of the element and not its
index (at least in documents and words example).
Do you agree?
On Mon, Jul 30, 2012 at 11:58 AM, Sean Owen sro...@gmail.com wrote:
I think that's right,
Yes I know what you mean. In my understanding you typically apply minhash
to a large sparse vector that acts like a bit set, where the index is
really the set member. There you want to hash the index, and doing so by
considering all indices would be completely wrong.
Here I think the set elements
If vectors are treated as dense, then we have to modify the example given
for this class, which clearly talks about documents and words:
https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen sro...@gmail.com wrote:
Yes I know what you
Hmm yeah what's the output format of that job? At first glance it looks
like it makes a dictionary and fills in sequentially with term frequencies.
If so, applying min-hash this way isn't wildly wrong but I still don't
think it works.
Can anyone who knows this code at all weigh in? it's not my
The input to the MinHashDriver are tf-vectors, e.g.:
hdfs -text //tf-vectors/part-r-0
10020926
{96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0}
10020927
{53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0}
10020928
From my reading of the code and the unit test, that is totally different
from the expected input of MinHashDriver. So I think the example should
just be deleted.
You would have to convert this to something like:
{0:96037,1:114292,2:119745,...}
to make it work correctly.
On Mon, Jul 30, 2012 at
OK. But I preferred to modify a couple of lines of MinHashMapper rather
than writing my own spaseVectors generator:
...
for (int i = 0; i numHashFunctions; i++) {
Iterator it = featureVector.iterateNonZero();
while (it.hasNext()) {
Definitely, that works for the example, though it then fails for the unit
test.
On Mon, Jul 30, 2012 at 1:41 PM, Elena Smirnova esmirno...@gmail.comwrote:
OK. But I preferred to modify a couple of lines of MinHashMapper rather
than writing my own spaseVectors generator:
...
[
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425091#comment-13425091
]
Pat Ferrel commented on MAHOUT-1045:
I've had a chance to run this on several data
[
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425131#comment-13425131
]
Jeff Eastman commented on MAHOUT-1045:
--
Good news. I will try to get this patch
Sean,
I'll take care of this, I added this sometime last year but was never convinced
that it ever worked right.
From: Sean Owen sro...@gmail.com
To: dev@mahout.apache.org
Sent: Monday, July 30, 2012 5:18 PM
Subject: Re: Possible issue in MinHashMapper
Done.
From: Suneel Marthi suneel_mar...@yahoo.com
To: dev@mahout.apache.org dev@mahout.apache.org
Sent: Monday, July 30, 2012 6:44 PM
Subject: Re: Possible issue in MinHashMapper
Sean,
I'll take care of this, I added this sometime last year but was never
See https://builds.apache.org/job/Mahout-Quality/1604/
15 matches
Mail list logo