Possible issue in MinHashMapper

2012-07-30 Thread Elena Smirnova
Hello, It seems to me that there is an issue in MinHashMapper class. In map method, the loop goes over the elements in the vector. In many cases the instance of Vector abstract class is a SparseVector and iteration would meant to be over non-zeros values (e.g., documents as a sparse vector of

Re: Possible issue in MinHashMapper

2012-07-30 Thread Sean Owen
I think that's right, though I think the effect on correctness is quite small, but the effect on performance is large. This will always hash zero even if zero were not really present in the vector. That is not likely to produce the smallest hash value though. Hashing all those zeroes is wasteful

Re: Possible issue in MinHashMapper

2012-07-30 Thread Elena Smirnova
I agree about performance effect of iterating over zeros. But the correctness effect comes due to hashing values of the element and not its index (at least in documents and words example). Do you agree? On Mon, Jul 30, 2012 at 11:58 AM, Sean Owen sro...@gmail.com wrote: I think that's right,

Re: Possible issue in MinHashMapper

2012-07-30 Thread Sean Owen
Yes I know what you mean. In my understanding you typically apply minhash to a large sparse vector that acts like a bit set, where the index is really the set member. There you want to hash the index, and doing so by considering all indices would be completely wrong. Here I think the set elements

Re: Possible issue in MinHashMapper

2012-07-30 Thread Elena Smirnova
If vectors are treated as dense, then we have to modify the example given for this class, which clearly talks about documents and words: https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen sro...@gmail.com wrote: Yes I know what you

Re: Possible issue in MinHashMapper

2012-07-30 Thread Sean Owen
Hmm yeah what's the output format of that job? At first glance it looks like it makes a dictionary and fills in sequentially with term frequencies. If so, applying min-hash this way isn't wildly wrong but I still don't think it works. Can anyone who knows this code at all weigh in? it's not my

Re: Possible issue in MinHashMapper

2012-07-30 Thread Elena Smirnova
The input to the MinHashDriver are tf-vectors, e.g.: hdfs -text //tf-vectors/part-r-0 10020926 {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0} 10020927 {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0} 10020928

Re: Possible issue in MinHashMapper

2012-07-30 Thread Sean Owen
From my reading of the code and the unit test, that is totally different from the expected input of MinHashDriver. So I think the example should just be deleted. You would have to convert this to something like: {0:96037,1:114292,2:119745,...} to make it work correctly. On Mon, Jul 30, 2012 at

Re: Possible issue in MinHashMapper

2012-07-30 Thread Elena Smirnova
OK. But I preferred to modify a couple of lines of MinHashMapper rather than writing my own spaseVectors generator: ... for (int i = 0; i numHashFunctions; i++) { Iterator it = featureVector.iterateNonZero(); while (it.hasNext()) {

Re: Possible issue in MinHashMapper

2012-07-30 Thread Sean Owen
Definitely, that works for the example, though it then fails for the unit test. On Mon, Jul 30, 2012 at 1:41 PM, Elena Smirnova esmirno...@gmail.comwrote: OK. But I preferred to modify a couple of lines of MinHashMapper rather than writing my own spaseVectors generator: ...

[jira] [Commented] (MAHOUT-1045) Cluster evaluators returning bad results

2012-07-30 Thread Pat Ferrel (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425091#comment-13425091 ] Pat Ferrel commented on MAHOUT-1045: I've had a chance to run this on several data

[jira] [Commented] (MAHOUT-1045) Cluster evaluators returning bad results

2012-07-30 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425131#comment-13425131 ] Jeff Eastman commented on MAHOUT-1045: -- Good news. I will try to get this patch

Re: Possible issue in MinHashMapper

2012-07-30 Thread Suneel Marthi
Sean, I'll take care of this, I added this sometime last year but was never convinced that it ever worked right. From: Sean Owen sro...@gmail.com To: dev@mahout.apache.org Sent: Monday, July 30, 2012 5:18 PM Subject: Re: Possible issue in MinHashMapper

Re: Possible issue in MinHashMapper

2012-07-30 Thread Suneel Marthi
Done. From: Suneel Marthi suneel_mar...@yahoo.com To: dev@mahout.apache.org dev@mahout.apache.org Sent: Monday, July 30, 2012 6:44 PM Subject: Re: Possible issue in MinHashMapper Sean, I'll take care of this, I added this sometime last year but was never

Jenkins build is still unstable: Mahout-Quality #1604

2012-07-30 Thread Apache Jenkins Server
See https://builds.apache.org/job/Mahout-Quality/1604/