Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16966
@MLnick @jkbradley @sethah Could you take a review? Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/17092
@MLnick @jkbradley @sethah Could you take a review? Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/17092
Ping.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16966
Ping.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/17092
@jkbradley @sethah Please take a review when you have time. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16966
@MLnick @jkbradley Please take a review when you have time. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/17104
@srowen The full name works. Just want to make the comments shorter so that
it's easier to read.
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/17092
@jkbradley @MLnick Here is a clean PR. Sorry for messing up the previous
one!
@merlintang I am happy to continue our discussion here:
https://issues.apache.org/jira/browse/SPARK-19771
GitHub user Yunni opened a pull request:
https://github.com/apache/spark/pull/17104
[MINOR][ML] Fix comments in LSH Examples and Python API
## What changes were proposed in this pull request?
Remove `org.apache.spark.examples.` in
Add slash in one of the python doc
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r103361528
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,196 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
GitHub user Yunni opened a pull request:
https://github.com/apache/spark/pull/17092
[SPARK-18450][ML] Scala API Change for LSH AND-amplification
## What changes were proposed in this pull request?
Implemented a new Param numHashFunctions as the dimension of
AND-amplification
Github user Yunni closed the pull request at:
https://github.com/apache/spark/pull/16965
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
Looks like the rebase is making it even worse. I will reopen a PR.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16966
@MLnick I did some experiments with WEX datasets. I have put the results in
the description.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
The number of rows would be O(LN). The memory usage will be different as
the size of each row has changed before and after the explode. Also the
Catalyst Optimizer may do projections during join
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
@merlintang Not exactly. Each row will explode to L rows, where L is the
number of hash tables. Like the following
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
@merlintang
(1) `hashDistance` is only used for multi-probe NN Search. The term
`numHashTables`, `numHashFunctions` is very hard to interpret in OR-AND cases.
(2) For similarity join, we
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
@merlintang Sorry I still don't quite get why we need to support OR-AND
when the effective threshold is low. My understanding is that we can always
tune numHashTables and numHashFunctions
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
@merlintang We use AND-OR in both approxNearestNeighbor and
approxSimilarityJoin, and it's more difficult for approxSimilarityJoin to adopt
OR-AND than AND-OR.
My understanding: for a (d1
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16715
Hi @e-m-m, I think the Python API will be included in Spark 2.2.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16966#discussion_r102065786
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -147,6 +148,15 @@ private[ml] abstract class LSHModel[T <: LSHMode
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16966#discussion_r101832855
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -147,6 +148,15 @@ private[ml] abstract class LSHModel[T <: LSHMode
GitHub user Yunni opened a pull request:
https://github.com/apache/spark/pull/16966
[SPARK-18409][ML]LSH approxNearestNeighbors should use approxQuantile
instead of sort
## What changes were proposed in this pull request?
In previous implementation of LSH approxNearestNeighbors
GitHub user Yunni opened a pull request:
https://github.com/apache/spark/pull/16965
[Spark-18450][ML] Scala API Change for LSH AND-amplification
## What changes were proposed in this pull request?
Implemented a new Param numHashFunctions as the dimension of
AND-amplification
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16715
Sure. Will do.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16715
@sethah Really appreciate your detailed code review and comments. :)
@MLnick @yanboliang Thank you for the help as well. Please let me know if
you guys have any other comments.
---
If your
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r101089800
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -222,17 +222,18 @@ private[ml] abstract class LSHModel[T <: LSHMode
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r101089807
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala ---
@@ -37,38 +43,45 @@ object MinHashLSHExample {
(0
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r101089762
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +945,103 @@ def maxAbs(self):
@inherit_doc
+class MinHashLSH(JavaEstimator
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100966548
--- Diff: docs/ml-features.md ---
@@ -1558,6 +1558,15 @@ for more details on the API.
{% include_example
java/org/apache/spark/examples/ml
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100966555
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100966541
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala ---
@@ -37,38 +38,44 @@ object MinHashLSHExample {
(0
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100966552
--- Diff:
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py ---
@@ -0,0 +1,81 @@
+#
+# Licensed to the Apache Software Foundation
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100966561
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala
---
@@ -38,40 +39,45 @@ object
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100966545
--- Diff:
examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
---
@@ -35,6 +35,8 @@
import
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100966554
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100966534
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100966539
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala
---
@@ -111,8 +111,8 @@ class BucketedRandomProjectionLSHModel
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100966530
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100966537
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16715
Jenkins retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100199037
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100198559
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100192059
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100193058
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
@inherit_doc
+class MinHashLSH(JavaEstimator
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100193020
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
@inherit_doc
+class MinHashLSH(JavaEstimator
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100193043
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
@inherit_doc
+class MinHashLSH(JavaEstimator
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100192985
--- Diff: examples/src/main/python/ml/bucketed_random_projection_lsh.py ---
@@ -0,0 +1,76 @@
+#
+# Licensed to the Apache Software Foundation (ASF
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100192933
--- Diff: examples/src/main/python/ml/bucketed_random_projection_lsh.py ---
@@ -0,0 +1,76 @@
+#
+# Licensed to the Apache Software Foundation (ASF
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100192881
--- Diff: examples/src/main/python/ml/min_hash_lsh.py ---
@@ -0,0 +1,75 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100192685
--- Diff: examples/src/main/python/ml/min_hash_lsh.py ---
@@ -0,0 +1,75 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100192402
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100192347
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100192333
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100192298
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100192314
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100192074
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100192026
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16715
@yanboliang, just a friendly reminder please don't forget to review the PR
when you have time. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16715
Thanks very much, @yanboliang ~~
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16715
@yanboliang @jkbradley Please take a look. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
GitHub user Yunni opened a pull request:
https://github.com/apache/spark/pull/16715
[Spark-18080][ML] Python API & Examples for Locality Sensitive Hashing
## What changes were proposed in this pull request?
This pull request includes python API and examples for LSH. The
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/15795
@MLnick @jkbradley I have changed the examples to be 1 example per
algorithm which does transform, approxNearestNeighbor, and
approxSimilarityJoin. PTAL.
---
If your project is set up for it, you
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90736878
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90736862
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90736883
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala
---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90736839
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90736831
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90736852
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90736546
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90736531
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90736515
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90736506
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/15795
@sethah I think so. I have made changes for the docs but I haven't made
changes to the examples. Please take a look when you get a chance.
---
If your project is set up for it, you can reply
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/15874
@jkbradley If you don't have more comments, can we merge this because I
need to change the examples in #15795 ?
---
If your project is set up for it, you can reply to this email and have your
reply
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r89711243
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r89711247
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r89711255
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r89711233
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r89711231
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r89711207
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r89711204
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r89711166
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r89711162
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r89711165
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/15874
Thanks @sethah ! Your comment was very helpful and detailed :-)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/15874
@sethah PTAL
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15874#discussion_r89215405
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala
---
@@ -31,36 +31,38 @@ import org.apache.spark.sql.types.StructType
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15874#discussion_r89215190
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala
---
@@ -112,25 +116,26 @@ class MinHash(override val uid: String) extends
LSH
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15874#discussion_r89215142
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala ---
@@ -97,12 +118,31 @@ class MinHashSuite extends SparkFunSuite
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15874#discussion_r89175604
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSHSuite.scala
---
@@ -43,70 +43,73 @@ class RandomProjectionSuite
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15874#discussion_r89175438
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -155,8 +148,30 @@ private[ml] abstract class LSHModel[T <: LSHMode
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15874#discussion_r89175473
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -155,8 +148,30 @@ private[ml] abstract class LSHModel[T <: LSHMode
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15874#discussion_r89175497
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala
---
@@ -31,36 +31,38 @@ import org.apache.spark.sql.types.StructType
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15874#discussion_r89175448
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -155,8 +148,30 @@ private[ml] abstract class LSHModel[T <: LSHMode
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15874#discussion_r89175401
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala
---
@@ -31,36 +31,38 @@ import org.apache.spark.sql.types.StructType
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/15874
Hi @sethah, grouping to a number of buckets does not really affect the
independence since p is a mach larger prime. For example, in
http://people.csail.mit.edu/mip/papers/kwise-lb/kwise-lb.pdf
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/15874
@jkbradley Awesome, thanks so much! :) Now that the API is finalized, I
will work on the User Doc
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/15874
Hi @jkbradley,
**MinHash**
Yes, I agree that I shouldn't have said it's perfect hashing.
Theoretically, it should be Min-wise Independent Permutation Family. What we
used here is 2
Github user Yunni commented on a diff in the pull request:
https://github.com/apache/spark/pull/15874#discussion_r88569303
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSHSuite.scala
---
@@ -43,70 +43,72 @@ class RandomProjectionSuite
1 - 100 of 269 matches
Mail list logo