[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88569315 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSHSuite.scala --- @@ -115,64 +117,83 @@ class RandomProjectionSuite

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88569321 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala --- @@ -86,9 +94,24 @@ class MinHashSuite extends SparkFunSuite

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88569056 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala --- @@ -147,15 +151,17 @@ class RandomProjection(override val

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88569084 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -31,36 +31,34 @@ import org.apache.spark.sql.types.StructType

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88569066 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -31,36 +31,34 @@ import org.apache.spark.sql.types.StructType

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88169546 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -179,16 +211,13 @@ private[ml] abstract class LSHModel[T <: LSHMode

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88150618 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -179,16 +211,13 @@ private[ml] abstract class LSHModel[T <: LSHMode

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88129780 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala --- @@ -24,7 +24,7 @@ import org.apache.spark.ml.util.DefaultReadWriteTest

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88129663 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -179,16 +211,13 @@ private[ml] abstract class LSHModel[T <: LSHMode

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88129409 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -106,22 +123,24 @@ private[ml] abstract class LSHModel[T <: LSHMode

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88128756 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -144,12 +152,12 @@ class MinHash(override val uid: String) extends LSH

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88128823 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -66,10 +66,10 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]] s

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88128732 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -125,11 +125,11 @@ class MinHash(override val uid: String) extends LSH

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88128687 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -74,9 +72,12 @@ class MinHashModel private[ml

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88128341 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -74,9 +72,12 @@ class MinHashModel private[ml

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88128287 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType @Since

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88128199 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -31,13 +31,9 @@ import org.apache.spark.sql.types.StructType

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88128252 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType @Since

[GitHub] spark issue #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15874 Thanks, @sethah. I have reverted "AND-amplification" related changes. PTAL. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as wel

[GitHub] spark pull request #15800: [SPARK-18334] MinHash should use binary hash dist...

2016-11-13 Thread Yunni
Github user Yunni closed the pull request at: https://github.com/apache/spark/pull/15800 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-13 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15800 OK. Abandon this PR since we are making MultiProbe NN Search and `hashDistance` private. Related changes are included in #15874 --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request #15874: Spark 18408 yunn api improvements

2016-11-13 Thread Yunni
GitHub user Yunni opened a pull request: https://github.com/apache/spark/pull/15874 Spark 18408 yunn api improvements ## What changes were proposed in this pull request? (1) Change output schema to `Array of Vector` instead of `Vectors` (2) Use `numHashTables

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-11 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15800 @MLnick Thanks! That's very good to know! @sethah I agree with your comments. @jkbradley If you don't have objection, shall I remove MultiProbe NN Search and `hashDistance`, so we

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-10 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15800 @jkbradley I agree with your idea to get rid of full sorting and use `approxQuantile` to find the threshold. Doing a full sort on whole dataset hurts a lot in performance. Please file a ticket

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-10 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15800 > One way to look at it is that (a) will contain many duplicates in the L sets of points, so (b) is more likely to have higher precision and recall. I think this might be the place

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-10 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15800 @sethah That sounds good to me, expect that there is no `posexplode()` in spark AFAIK. Do you think `hashDistance(x: Array[Vector], y: Array[Vector])` is a better workaround, or we should still use

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-10 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15800 > If a query point vector q hashes to some MinHash Vector [5.0, 22.0, 13.0] the best candidates will be ones that hash to that same vector. My second half is suggesting: If a query po

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-09 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15800 Hi @jkbradley, I agree with your claim on estimating Jaccard similarity, but looks like your `L` and `k` are having the same effect on the performance. Consider a case when we want to trade

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-09 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15148 Thanks for the discussion, everyone! I will take a look at the JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-09 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15800 @jkbradley There are 2 reason I don't think averaging indicators is a good hashDistance for the current implementation. (1) SingleProbe NN performance relies on OR-amplification, changing

[GitHub] spark pull request #15800: [SPARK-18334] MinHash should use binary hash dist...

2016-11-09 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15800#discussion_r87298552 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -32,13 +32,7 @@ import org.apache.spark.sql.types.StructType

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-09 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15800 @jkbradley Averaging indicators make more sense for an AND-amplified MinHash function. The hash distance is 0 when all hash values are equal, and grows as the more hash values differ. As we

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-08 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15800 @sethah Not exactly. Based on the logic in `approxNearestNeighbor`, if there aren't enough candidates where the distance is zero, we'll scan the the whole dataset. I don't think multi

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-07 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15148 @jkbradley I agree with most of your comments above. And I would like to suggest the following: - I would recommend a more intuitive name like `HyperplaneProjection` instead of `PStableHashing

[GitHub] spark issue #15795: [SPARK-18081] Add user guide for Locality Sensitive Hash...

2016-11-07 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15795 @bravo-zhang @srowen I am OK to use the example in #15787. But I still think `approxNearestNeighbor` and `approxSimilarityJoin` are different algorithms and it would be easier for user

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r86889880 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/RandomProjectionExample.scala --- @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r86889863 --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java --- @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r86889844 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,134 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r86889702 --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java --- @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r86889774 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,134 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r86877678 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,134 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15800: [SPARK-18334] MinHash should use binary hash dist...

2016-11-07 Thread Yunni
GitHub user Yunni opened a pull request: https://github.com/apache/spark/pull/15800 [SPARK-18334] MinHash should use binary hash distance ## What changes were proposed in this pull request? MinHash currently is using the same `hashDistance` function as RandomProjection

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-07 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15148 @sethah Yes, that's why `outputDim` is introduced for users to trade off between false negative rate and running time. During my tests, LSH without amplification can be (0.5, 0.5)-sensitive

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-06 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r86724596 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-06 Thread Yunni
GitHub user Yunni opened a pull request: https://github.com/apache/spark/pull/15795 [SPARK-18081] Add user guide for Locality Sensitive Hashing(LSH) ## What changes were proposed in this pull request? The user guide for LSH is added to ml-features.md, with several scala/java

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-06 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15148 @sethah I think you are right. OR-amplification is only applied inside NN search and similarity join through `hashDistance` and `explode`. `transform` itself does not apply amplifications

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-28 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15148 Awesome! Thanks Joseph and thanks everyone else for reviewing this! 👍 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-28 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r85591762 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-27 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r85459596 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-27 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r85459447 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,339 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-27 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r85444756 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-27 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r85444781 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,192 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-27 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r85424257 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-27 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r85418671 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,186 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-27 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r85417885 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,339 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-26 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15148 Thanks @jkbradley . I have made several changes to unit tests. Please let me know if I missed any. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-26 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r85248006 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RandomProjectionSuite.scala --- @@ -0,0 +1,148 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-26 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r85248016 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,146 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-26 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r85247717 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84586831 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84586829 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15148 Thanks @jkbradley. I have removed BitSampling and SignRandomProjection for a follow-up PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-13 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15148 Have no idea to solve this MiMa test. Could anyone give some clue? ``` java.lang.ArrayIndexOutOfBoundsException: 1660 at com.typesafe.tools.mima.core.BufferReader.nextByte

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-13 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r83149648 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-11 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82871238 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,146 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82726587 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722577 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722244 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722195 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,159 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722184 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722187 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722189 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722181 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722185 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82722177 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82721024 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,339 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82676608 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635922 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635900 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635989 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,339 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635943 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635973 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,339 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635955 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635937 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635871 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635849 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635887 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635879 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635859 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635828 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635792 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala --- @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635810 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala --- @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635817 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala --- @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635840 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82635804 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala --- @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-10 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82619926 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,339 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-09 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82539368 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,339 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-09 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82534311 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,339 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-05 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15148 @jkbradley Take you time for the code review. :) I will be working on the open dataset testing at the same time. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-05 Thread Yunni
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82027195 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,334 @@ +/* + * Licensed to the Apache Software Foundation (ASF

<    1   2   3   >