Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
Looks like the rebase is making it even worse. I will reopen a PR.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16965
**[Test build #73538 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73538/testReport)**
for PR 16965 at commit
Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/16965
Github isn't handling the merge well, so you might try rebasing
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16965
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16965
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73512/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16965
**[Test build #73512 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73512/testReport)**
for PR 16965 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16965
**[Test build #73512 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73512/testReport)**
for PR 16965 at commit
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
The number of rows would be O(LN). The memory usage will be different as
the size of each row has changed before and after the explode. Also the
Catalyst Optimizer may do projections during join
Github user merlintang commented on the issue:
https://github.com/apache/spark/pull/16965
@Yunni thanks, where I mention the L is the number of hash tables.
By this way, the memory usage would be O(L*N). the approximate NN searching
cost in one partition is O(L*N'). Where N
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
@merlintang Not exactly. Each row will explode to L rows, where L is the
number of hash tables. Like the following:
```
++-++
|
Github user merlintang commented on the issue:
https://github.com/apache/spark/pull/16965
@Yunni Ok, if we want to move this quicker, we can keep the current AND-OR
implementation.
(2)(3) you mention that you explode the inner table (dataset). Does it mean
for each tuple of
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
@merlintang
(1) `hashDistance` is only used for multi-probe NN Search. The term
`numHashTables`, `numHashFunctions` is very hard to interpret in OR-AND cases.
(2) For similarity join, we
Github user merlintang commented on the issue:
https://github.com/apache/spark/pull/16965
@Yunni Yes, we can use the AND-OR to increase the possibility by having
more the numHashTables and numHashFunctions. For the further user extension, if
users have a hash function with lower
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
@merlintang Sorry I still don't quite get why we need to support OR-AND
when the effective threshold is low. My understanding is that we can always
tune numHashTables and numHashFunctions for AND-OR
Github user merlintang commented on the issue:
https://github.com/apache/spark/pull/16965
@Yunni I agree with you that the current NN search and Join are using the
AND-OR. We can discuss how to use the OR-AND for that two searching as well.
For the OR-AND option, it is
Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
@merlintang We use AND-OR in both approxNearestNeighbor and
approxSimilarityJoin, and it's more difficult for approxSimilarityJoin to adopt
OR-AND than AND-OR.
My understanding: for a (d1,
Github user merlintang commented on the issue:
https://github.com/apache/spark/pull/16965
It seems this patch provide the AND-OR amplification. Can we provide the
option for users to choose the OR-AND amplification as well?
---
If your project is set up for it, you can reply to
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16965
cc @sethah @jkbradley
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16965
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73016/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16965
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16965
**[Test build #73016 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73016/testReport)**
for PR 16965 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16965
**[Test build #73016 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73016/testReport)**
for PR 16965 at commit
22 matches
Mail list logo