[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/16965 Looks like the rebase is making it even worse. I will reopen a PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16965 **[Test build #73538 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73538/testReport)** for PR 16965 at commit

[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/16965 Github isn't handling the merge well, so you might try rebasing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16965 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16965 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73512/ Test PASSed. ---

[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16965 **[Test build #73512 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73512/testReport)** for PR 16965 at commit

[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16965 **[Test build #73512 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73512/testReport)** for PR 16965 at commit

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-26 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/16965 The number of rows would be O(LN). The memory usage will be different as the size of each row has changed before and after the explode. Also the Catalyst Optimizer may do projections during join

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-24 Thread merlintang
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16965 @Yunni thanks, where I mention the L is the number of hash tables. By this way, the memory usage would be O(L*N). the approximate NN searching cost in one partition is O(L*N'). Where N

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-24 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/16965 @merlintang Not exactly. Each row will explode to L rows, where L is the number of hash tables. Like the following: ``` ++-++ |

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-24 Thread merlintang
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16965 @Yunni Ok, if we want to move this quicker, we can keep the current AND-OR implementation. (2)(3) you mention that you explode the inner table (dataset). Does it mean for each tuple of

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-24 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/16965 @merlintang (1) `hashDistance` is only used for multi-probe NN Search. The term `numHashTables`, `numHashFunctions` is very hard to interpret in OR-AND cases. (2) For similarity join, we

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-24 Thread merlintang
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16965 @Yunni Yes, we can use the AND-OR to increase the possibility by having more the numHashTables and numHashFunctions. For the further user extension, if users have a hash function with lower

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-23 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/16965 @merlintang Sorry I still don't quite get why we need to support OR-AND when the effective threshold is low. My understanding is that we can always tune numHashTables and numHashFunctions for AND-OR

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-23 Thread merlintang
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16965 @Yunni I agree with you that the current NN search and Join are using the AND-OR. We can discuss how to use the OR-AND for that two searching as well. For the OR-AND option, it is

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-23 Thread Yunni
Github user Yunni commented on the issue: https://github.com/apache/spark/pull/16965 @merlintang We use AND-OR in both approxNearestNeighbor and approxSimilarityJoin, and it's more difficult for approxSimilarityJoin to adopt OR-AND than AND-OR. My understanding: for a (d1,

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-22 Thread merlintang
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16965 It seems this patch provide the AND-OR amplification. Can we provide the option for users to choose the OR-AND amplification as well? --- If your project is set up for it, you can reply to

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-20 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16965 cc @sethah @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16965 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73016/ Test PASSed. ---

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16965 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16965 **[Test build #73016 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73016/testReport)** for PR 16965 at commit

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16965 **[Test build #73016 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73016/testReport)** for PR 16965 at commit