Re: [SPARK ML] Minhash integer overflow

2018-07-07 Thread Kazuaki Ishizaki
value generated by Spark with it generated by other implementations? Regards, Kazuaki Ishizaki From: Sean Owen To: jiayuanm Cc: dev@spark.apache.org Date: 2018/07/07 15:46 Subject:Re: [SPARK ML] Minhash integer overflow I think it probably still does its.job; the hash

Re: [SPARK ML] Minhash integer overflow

2018-07-07 Thread Sean Owen
I think it probably still does its.job; the hash value can just be negative. It is likely to be very slightly biased though. Because the intent doesn't seem to be to allow the overflow it's worth changing to use longs for the calculation. On Fri, Jul 6, 2018, 8:36 PM jiayuanm wrote: > Hi

Re: [SPARK ML] Minhash integer overflow

2018-07-06 Thread jiayuanm
Sure. JIRA ticket is here: https://issues.apache.org/jira/browse/SPARK-24754. I'll create the PR. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail:

Re: [SPARK ML] Minhash integer overflow

2018-07-06 Thread Kazuaki Ishizaki
@spark.apache.org Date: 2018/07/07 10:36 Subject:[SPARK ML] Minhash integer overflow Hi everyone, I was playing around with LSH/Minhash module from spark ml module. I noticed that hash computation is done with Int (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache

[SPARK ML] Minhash integer overflow

2018-07-06 Thread jiayuanm
Hi everyone, I was playing around with LSH/Minhash module from spark ml module. I noticed that hash computation is done with Int (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69). Since "a" and "b" are from a uniform