Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
Seems we can't get an agreement on this topic, so I'd close this for now.
---
-
To unsubscribe, e-mail:
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
> The order is different from the original one that is evaluated in the
join conditions.
I'm not sure what original order you meant. By pulling out to `Project`,
they are evaluated by their
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18652
The order is different from the original one that is evaluated in the join
conditions.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18652
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81054/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18652
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18652
**[Test build #81054 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81054/testReport)**
for PR 18652 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
Join [t1.a = rand(t2.b), t1.c = rand(t2.d)]
Sort
Project [t1.a, t1.c]
TableScan t1
Sort
Project [rand(t2.b) as rand(t2.b),
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18652
**[Test build #81054 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81054/testReport)**
for PR 18652 at commit
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18652
We could add a `Sort` above the `Project` and the orders become different,
right?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
@cloud-fan @gatorsmile More thoughts or comments for this change? Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
When we join two tables, given there are equi-join keys, and they are
non-deterministic, for example `t1.a = rand(t2.b)` and `t1.c = rand(t2.d)`. We
pull out them to downstream project:
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18652
Did not get your point. Could you just give an example why the
non-deterministic expressions are always evaluated in the same order no matter
which join types are chosen during the physical
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
Once we pull out them into downstream project, should we still worry about
call orders? They are evaluated before sort or shuffle added later.
---
If your project is set up for it, you can reply to
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18652
You are talking about the number of calls. I am worrying about the call
orders. We could add a `SORT`.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
> Why equi-join is free from the issues?
Assume the equi-join predicates are in the form like `t1.a = rand(t2.b) &&
t1.c = rand(t2.d)`. When we compare the equi-join keys `(t1.a, t1.c)` and
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18652
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18652
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80376/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18652
**[Test build #80376 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80376/testReport)**
for PR 18652 at commit
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18652
> As said in previous discussion, we can't avoid few issues regarding
non-deterministic non equi join condition. We can simply allow it, but it faces
inconsistency due to different join
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18652
**[Test build #80376 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80376/testReport)**
for PR 18652 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
@gatorsmile @cloud-fan Do you have more comments or thoughts on this?
Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
@baibaichen when we do so, I think the result is not as same as Hive's join
result. Is it still useful?
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user baibaichen commented on the issue:
https://github.com/apache/spark/pull/18652
can we add a flag i.e. ignore-non-deterministic , so that we can treat
non-deterministic as deterministic, I believe this is what hive does.
---
If your project is set up for it, you can reply
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
@gatorsmile Ok. No problem. Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18652
Let me talk with more people to get the feedbacks. Will respond you later.
Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
@gatorsmile Actually it is not rare we add a feature step by step in
SparkSQL. This is not a reason preventing us from adding this support. I think
this change already help much this kind of
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/18652
I think the goal is just to resolve the migration issues for Hive users. If
we just provide a very limited support, I do not think it can help the workload
migration.
If we really want
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
Yea, for the case with non-deterministic non equi join conditions, you'd
face the issue of changing the number of calls. So I currently plan not to
support it here.
---
If your project is set up
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18652
yea I know that, I'm thinking about if we need to change it by considering
the position.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
No, I don't think it's true. I think we don't consider the position of equi
join condition.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18652
I mean, `t1.a = t2.b` before non-determinictic condition is an equi join
condition, but `t1.a = t2.b` after non-determinictic condition is not. Is this
true?
---
If your project is set up for
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
`t1.a = t2.b` is an equi join condition. `t1.c > rand()` is not. They will
be split and considered individually.
---
If your project is set up for it, you can reply to this email and have your
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18652
Can we say that, `t1.a = t2.b && t1.c > rand()` is a equal-join condition,
but `t1.c > rand() && t1.a = t2.b` is not?
---
If your project is set up for it, you can reply to this email and have
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
Btw, I guess that is why we also pull non-deterministic grouping
expressions for Aggregate?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
If we simply allow it, the evaluation order of non-deterministic join
conditions will be different on different join implementation, e.g. Sort-based
and Hash-based. Then we will get inconsistent
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18652
What if we simply allow non-deterministic join condition? Since we allow
non-deterministic filter condition, we should do this for join condition too?
---
If your project is set up for it, you
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
ping @cloud-fan Can you have time to review this? Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
37 matches
Mail list logo