[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-07-16 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 @cloud-fan do you have any further comments about this? Thanks. --- - To unsubscribe, e-mail:

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-21 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 ok so I created https://github.com/apache/spark/pull/21605 for the fix proposed by @daniel-shields. I'd like to leave this open in order to go on with the discussion for a long-term better fix.

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-21 Thread WenboZhao
Github user WenboZhao commented on the issue: https://github.com/apache/spark/pull/21449 I like the proposal by @daniel-shields. If we could get it fixed soon, we will be able to catch up the Spark 2.3.2 release. ---

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-06 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 @daniel-shields do you want to open a PR for that? I'll leave this PR open as it is a more general fix so we can go on with the long-term discussion here in this PR. Do you agree with this

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-05 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21449 > In the short term we should make the behavior of EqualTo and EqualNullSafe identical. This seems pretty safe and reasonable to me ---

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-05 Thread daniel-shields
Github user daniel-shields commented on the issue: https://github.com/apache/spark/pull/21449 In the short term we should make the behavior of EqualTo and EqualNullSafe identical. We could do that by adding a case for EqualNullSafe that mirrors that of EqualTo. ---

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-04 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 Sure, thanks for your time. PS `df.join(df, df("id") >= df("id"))` may be ambiguous, but in the example above `df1.join(df2, df2['id'].eqNullSafe(df1['id'])).collect()` where `df1` and

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-03 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21449 This will definitely not go into 2.3.1, so we have plenty of time. I'll think deeper into it after the spark summit. IMO `df.join(df, df("id") >= df("id"))` is ambiguous, especially when

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-03 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 I see what you mean. Honestly I have not thought of a full design for this problem (so I can't state what we should support and what not), but focusing on this specific case I think that:

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-02 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21449 My point is that, we may have a different design if we wanna solve this problem holistically, which may conflict with this patch. We should prove that this is in the right direction and future

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-06-01 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 Thanks for your comment @cloud-fan. I understand your point. That is quite a tricky problem, since we should know probably also the "DAG" of the dataframes in order to take the right decision.

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91343/ Test PASSed. ---

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-31 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91343 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91343/testReport)** for PR 21449 at commit

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-31 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21449 This is a long-standing issue, I've seen many attempts to fix it (including myself) but no one success. The major problem is, there is no clear definition of the expected behavior, i.e.

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-31 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 yes @daniel-shields, you are right with your analysis. The problem was that we were sometimes using `==`, sometimes `semanticEquals`. And `equals` has the problem you mentioned. I think

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3732/

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-31 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91343 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91343/testReport)** for PR 21449 at commit

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread daniel-shields
Github user daniel-shields commented on the issue: https://github.com/apache/spark/pull/21449 @mgaido91 I looked at the test failures and I think the changes to the Dataset,resolve method are causing havoc. Consider the Dataset.drop method with the following signature: ` def

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91303/ Test FAILed. ---

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91303 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91303/testReport)** for PR 21449 at commit

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3704/

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91303/testReport)** for PR 21449 at commit

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91298 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91298/testReport)** for PR 21449 at commit

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91298/ Test FAILed. ---

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3702/

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91298/testReport)** for PR 21449 at commit

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-30 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 thanks @daniel-shields , you're right. I am working to check if and how this can be fixed. Thanks for your catch! --- - To

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-29 Thread daniel-shields
Github user daniel-shields commented on the issue: https://github.com/apache/spark/pull/21449 This case can also occur when the datasets are different but share a common lineage. Consider the following: `df = spark.range(10) df1 = df.groupby('id').count() df2 =

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-29 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 @daniel-shields in that case you have 2 different datasets `df1` and `df2`. So they are 2 distinct attributes and the check `a.sameRef(b)` would return false. This is applied only in case you have

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91253/ Test PASSed. ---

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-29 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91253 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91253/testReport)** for PR 21449 at commit

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-29 Thread daniel-shields
Github user daniel-shields commented on the issue: https://github.com/apache/spark/pull/21449 I'm not sure that this behavior should be applied to all binary comparisons. It could result in unexpected behavior in some rare cases. For example: `df1.join(df2, df2['x'] < df1['x'])`

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3666/

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

2018-05-29 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91253 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91253/testReport)** for PR 21449 at commit