Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Chang Chen
Sorry, I didn't express clearly. I think the evaluation order doesn't matter in the context of join implementation(sort or hash based). it should only refer to join key. Thanks Chang On Tue, Jul 18, 2017 at 7:57 AM, Liang-Chi Hsieh wrote: > > Evaluation order does matter. A

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh
Evaluation order does matter. A non-deterministic expression can change its output due to internal state which may depend on input order. MonotonicallyIncreasingID is an example for the stateful expression. Once you change the row order, the evaluation results are different. Chang Chen wrote

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Xiao Li
When users call rand(seed) with a specific seed number, users expect the results should be deterministic no matter whether this is pushed down or not. rand(seed) is stateful. Thus, the order of predicates in the same join condition even matters. For example, in the same join condition, if the

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Chang Chen
I see. Actually, it isn't about evaluation order which user can't specify. It's about how many times we evaluate the non-deterministic expression for the same row. For example, given the SQL: SELECT a.col1 FROM tbl1 a LEFT OUTER JOIN tbl2 b ON CASE WHEN a.col2 IS NULL TNEN cast(rand(9)*1000 -

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh
IIUC, the evaluation order of rows in Join can be different in different physical operators, e.g., Sort-based and Hash-based. But for non-deterministic expressions, different evaluation orders change results. Chang Chen wrote > I see the issue. I will try

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Chang Chen
I see the issue. I will try https://github.com/apache/spark/pull/18652, I think 1 For Join Operator, the left and right plan can't be non-deterministic. 2 If Filter can support non-deterministic, why not join condition? 3 We can't push down or project non-deterministic expression, since it may

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh
I created a draft pull request for explaining the cases: https://github.com/apache/spark/pull/18652 Chang Chen wrote > Hi All > > I don't understand the difference between the semantics, I found Spark > does > the same thing for GroupBy non-deterministic. From Map-Reduce point of > view, Join

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread 蒋星博
FYI there have been a related discussion here: https://github.com/apache/spark/pull/15417#discussion_r85295977 2017-07-17 15:44 GMT+08:00 Chang Chen : > Hi All > > I don't understand the difference between the semantics, I found Spark > does the same thing for GroupBy

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Chang Chen
Hi All I don't understand the difference between the semantics, I found Spark does the same thing for GroupBy non-deterministic. From Map-Reduce point of view, Join is also GroupBy in essence . @Liang Chi Hsieh in which situation,

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh
Thinking about it more, I think it changes the semantics only under certain scenarios. For the example SQL query shown in previous discussion, it looks the same semantics. Xiao Li wrote > If the join condition is non-deterministic, pushing it down to the > underlying project will change the

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-16 Thread Xiao Li
If the join condition is non-deterministic, pushing it down to the underlying project will change the semantics. Thus, we are unable to do it in PullOutNondeterministic. Users can do it manually if they do not care the semantics difference. Thanks, Xiao 2017-07-16 20:07 GMT-07:00 Chang Chen

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-16 Thread Chang Chen
It is tedious since we have lots of Hive SQL being migrated to Spark. And this workaround is equivalent to insert a Project between Join operator and its child. Why not do it in PullOutNondeterministic? Thanks Chang On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh wrote: >

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-14 Thread Liang-Chi Hsieh
A possible workaround is to add the rand column into tbl1 with a projection before the join. SELECT a.col1 FROM ( SELECT col1, CASE WHEN col2 IS NULL THEN cast(rand(9)*1000 - 99 as string) ELSE col2 END AS col2 FROM tbl1) a LEFT OUTER

Re:[SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread StanZhai
A workaround is diffcult. You should consider merging this PR into your Spark. "wangshuang [via Apache Spark Developers List]" wroted at 2017-07-13 18:43: I'm trying to execute hive sql on spark sql (Also

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread Herman van Hövell tot Westerflier
Just move the case expression into an underlying select clause. On Thu, Jul 13, 2017 at 3:10 PM, Chang Chen wrote: > Hi Wenchen > > Yes. We also find this error is caused by Rand. However, this is classic > way to solve data skew in Hive. Is there any equivalent way in

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread Chang Chen
Hi Wenchen Yes. We also find this error is caused by Rand. However, this is classic way to solve data skew in Hive. Is there any equivalent way in Spark? Thanks Chang On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan wrote: > It’s not about case when, but about rand().

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread Wenchen Fan
It’s not about case when, but about rand(). Non-deterministic expressions are not allowed in join condition. > On 13 Jul 2017, at 6:43 PM, wangshuang wrote: > > I'm trying to execute hive sql on spark sql (Also on spark thriftserver), For > optimizing data skew, we use "case