[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 @cloud-fan do you have any further comments about this? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 ok so I created https://github.com/apache/spark/pull/21605 for the fix proposed by @daniel-shields. I'd like to leave this open in order to go on with the discussion for a long-term better fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user WenboZhao commented on the issue: https://github.com/apache/spark/pull/21449 I like the proposal by @daniel-shields. If we could get it fixed soon, we will be able to catch up the Spark 2.3.2 release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 @daniel-shields do you want to open a PR for that? I'll leave this PR open as it is a more general fix so we can go on with the long-term discussion here in this PR. Do you agree with this approach @cloud-fan ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21449 > In the short term we should make the behavior of EqualTo and EqualNullSafe identical. This seems pretty safe and reasonable to me --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user daniel-shields commented on the issue: https://github.com/apache/spark/pull/21449 In the short term we should make the behavior of EqualTo and EqualNullSafe identical. We could do that by adding a case for EqualNullSafe that mirrors that of EqualTo. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 Sure, thanks for your time. PS `df.join(df, df("id") >= df("id"))` may be ambiguous, but in the example above `df1.join(df2, df2['id'].eqNullSafe(df1['id'])).collect()` where `df1` and `df2` are created from the same dataframe it is not ambiguous IMHO. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21449 This will definitely not go into 2.3.1, so we have plenty of time. I'll think deeper into it after the spark summit. IMO `df.join(df, df("id") >= df("id"))` is ambiguous, especially when it's not an inner join. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 I see what you mean. Honestly I have not thought of a full design for this problem (so I can't state what we should support and what not), but focusing on this specific case I think that: - at the moment we do support self-joins (at least in the case `df.join(df, df("id") >= df("id"))`) so considering this invalid would cause a big behavior change (potentially causing user workflows to break). - even though we might consider acceptable such a change in a major release, I think that we should support with the Dataframe API what we support in the SQL API, and SQL standard supports self joins (using aliases for the tables). So I do believe we should support this use case. - the case presented by @daniel-shields in https://github.com/apache/spark/pull/21449#issuecomment-392947474, I think is a valid one without any doubt. As of now we are not supporting it, though. So I think that in the holistic approach we shouldn't change the current behavior/approach which is present now and will be (IMHO) improved by this patch. What I do think we have to discuss in order not to have to change it - once we want to solve the more generic issue - is the way to track the dataset an attribute is coming from. Here I decided to use the metadata, since I thought this is the cleanest approach. Another approach might be to introduce a new `Option` in the `AttributeReference` a reference to the dataset it is coming from. For the generic solution, this might have the advantage that having a reference to the provenance dataset, where we might want to store some kind of DAG of the datasets this one is coming from in order to take more complex decision about the validity of the syntax and/or the resolution of the attribute. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21449 My point is that, we may have a different design if we wanna solve this problem holistically, which may conflict with this patch. We should prove that this is in the right direction and future fix will not conflict with it, or come out with the final fix directly. An example is, we may want to treat `df.join(df, df("id") >= df("id"))` as invalid in the final design. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 Thanks for your comment @cloud-fan. I understand your point. That is quite a tricky problem, since we should know probably also the "DAG" of the dataframes in order to take the right decision. But despite this change is related to that problem, I think it is different and with a much smaller scope. Indeed, while we can use the metadata information in many places, actually in this patch is is used only in the self-join case when there is ambiguity in which column to take. The behavior in any other case in unchanged. So after this patch, the situation in resolving column using `col` is unchanged. The only places where the dataset of provenance is checked is in self joins. The goal here is only to support cases which were throwing exceptions in resolving the right column. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91343/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91343 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91343/testReport)** for PR 21449 at commit [`8e6e5c0`](https://github.com/apache/spark/commit/8e6e5c0059574c1e171e589fcf533c6b5669499f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21449 This is a long-standing issue, I've seen many attempts to fix it (including myself) but no one success. The major problem is, there is no clear definition of the expected behavior, i.e. what's the semantic of `Dataset.col`? some examples ``` df.select(df.col("i")) // valid val df1 = df.filter(...) df1.select(df.col("i")) // still valid df.join(otherDF, df.col("i") === otherDF.col("i")) // valid df.join(otherDF).select(df.col("i"), otherDF("i")) // valid val df2 = df.select(df.col("i") + 1) df2.select(df.col("i")) // invalid ``` Sometime we can use an ancestor's column in a new Dataset but sometimes we can't. We should make the condition clear first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 yes @daniel-shields, you are right with your analysis. The problem was that we were sometimes using `==`, sometimes `semanticEquals`. And `equals` has the problem you mentioned. I think this is the only way for addressing the problem described here is to reference which dataset the column is coming from. I think adding a metadata for it is the cleanest way. We may also add a new attribute to the `Attribute` class instead of using metadata, but honestly this way seemed cleaner to me. What do you think? Do you have other suggestions? cc @cloud-fan @hvanhovell @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3732/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91343 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91343/testReport)** for PR 21449 at commit [`8e6e5c0`](https://github.com/apache/spark/commit/8e6e5c0059574c1e171e589fcf533c6b5669499f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user daniel-shields commented on the issue: https://github.com/apache/spark/pull/21449 @mgaido91 I looked at the test failures and I think the changes to the Dataset,resolve method are causing havoc. Consider the Dataset.drop method with the following signature: ` def drop(col: Column): DataFrame` There is a statement that may be comparing an AttributeReference with the new metadata to one without it: ``` val colsAfterDrop = attrs.filter { attr => attr != expression }.map(attr => Column(attr))``` This may be resulting in columns not getting dropped. I haven't verified but this is the first thing I would check. The change to resolve may be too drastic. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91303/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91303 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91303/testReport)** for PR 21449 at commit [`b8d5057`](https://github.com/apache/spark/commit/b8d50570b7b172ef310fdfb12b01be1598ff5481). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3704/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91303/testReport)** for PR 21449 at commit [`b8d5057`](https://github.com/apache/spark/commit/b8d50570b7b172ef310fdfb12b01be1598ff5481). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91298 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91298/testReport)** for PR 21449 at commit [`e8a5fa3`](https://github.com/apache/spark/commit/e8a5fa33d56187a6e30e81ba9439cd097fff5b2c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91298/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3702/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91298/testReport)** for PR 21449 at commit [`e8a5fa3`](https://github.com/apache/spark/commit/e8a5fa33d56187a6e30e81ba9439cd097fff5b2c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 thanks @daniel-shields , you're right. I am working to check if and how this can be fixed. Thanks for your catch! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user daniel-shields commented on the issue: https://github.com/apache/spark/pull/21449 This case can also occur when the datasets are different but share a common lineage. Consider the following: `df = spark.range(10) df1 = df.groupby('id').count() df2 = df.groupby('id').sum('id') df1.join(df2, df2['id'].eqNullSafe(df1['id'])).collect()` This currently fails with eqNullSafe, but works with ==. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 @daniel-shields in that case you have 2 different datasets `df1` and `df2`. So they are 2 distinct attributes and the check `a.sameRef(b)` would return false. This is applied only in case you have self-joins, ie. you have the same dataset on both sides. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91253/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91253 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91253/testReport)** for PR 21449 at commit [`92cb513`](https://github.com/apache/spark/commit/92cb513416c5dd0e9fa690c25cfae0565471a5e1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user daniel-shields commented on the issue: https://github.com/apache/spark/pull/21449 I'm not sure that this behavior should be applied to all binary comparisons. It could result in unexpected behavior in some rare cases. For example: `df1.join(df2, df2['x'] < df1['x'])` If 'x' is ambiguous, this would result in the conditional being flipped. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3666/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21449 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21449 **[Test build #91253 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91253/testReport)** for PR 21449 at commit [`92cb513`](https://github.com/apache/spark/commit/92cb513416c5dd0e9fa690c25cfae0565471a5e1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org