[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2019-08-19 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910691#comment-16910691
 ] 

Nicholas Chammas edited comment on SPARK-25150 at 8/19/19 7:39 PM:
---

I haven't been able to boil down the reproduction further, but I'm updating 
this issue to confirm that it is still present as of Spark 2.4.3 and, 
particularly in the case where cross joins are enabled, it appears to be a 
correctness issue.

My original attachments still capture the problem. These are the inputs:
 * [^persons.csv]
 * [^states.csv]
 * [^zombie-analysis.py]

And here are the outputs:
 * [^expected-output.txt]
 * [^output-without-implicit-cross-join.txt]
 * [^output-with-implicit-cross-join.txt]


was (Author: nchammas):
I haven't been able to boil down the reproduction further, but I'm updating 
this issue to confirm that it is still present as of Spark 2.4.3 and, 
particularly in the case where cross joins are enabled, it appears to be a 
correctness issue.

My original attachments still capture the problem. These are the inputs:
 * !persons.csv|width=7,height=7,align=absmiddle!
 * !states.csv|width=7,height=7,align=absmiddle!
 * [^zombie-analysis.py] !zombie-analysis.py|width=7,height=7,align=absmiddle!

And here are the outputs:
 * [^expected-output.txt]
 * [^output-without-implicit-cross-join.txt]
 * [^output-with-implicit-cross-join.txt]

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Peter Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632274#comment-16632274
 ] 

Peter Toth edited comment on SPARK-25150 at 9/28/18 7:28 PM:
-

[~nchammas], sorry for the late reply.

There is only one issue here. Please see zombie-analysis.py, it contains 2 
joins and both joins define the condition explicitly, so setting 
spark.sql.crossJoin.enabled=true {color:#33}should not have any 
effect.{color}

{color:#33}The root cause of the error you see when 
spark.sql.crossJoin.enabled=false (default) and the incorrect results when 
spark.sql.crossJoin.enabled=true is the same, the join condition is handled 
incorrectly.{color}

{color:#33}Please see my PR's description for further details: 
[https://github.com/apache/spark/pull/22318]{color}

 


was (Author: petertoth):
[~nchammas], sorry for the late reply.

There is only one issue here. Please see zombie-analysis.py, it contains 2 
joins and both joins define the condition explicitly, so setting 
spark.sql.crossJoin.enabled=true {color:#33}should not have any 
effect.{color}

{color:#33}Simply the SQL statement should not fail, please see my PR's 
description for further details: 
[https://github.com/apache/spark/pull/22318]{color}

 

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-31 Thread Evelyn Bayes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598275#comment-16598275
 ] 

Evelyn Bayes edited comment on SPARK-25150 at 8/31/18 6:06 AM:
---

I'd love the chance to bug patch this.

I've included a simplified version of the python script which produces it, if 
you switch out the second join to the commented join it works as it should. 
Unable to embed resource: zombie-analysis.py of type application/octet-stream

What's happening is during the creation of the logical plan it's re-aliasing 
the right side of the join because the left and right refer to the same base 
column. When it does this it renames all the columns in the right side of the 
join to the new alias but not the column which is actually a part of the join.

Then because the join refers to the column which hasn't been updated it now 
refers to the left side of the join. So it does a cartesian join on itself and 
straps on the right side of the join on the end.

The part of the code which is doing the renaming is:
 
[https://github.com/apache/spark/blob/v2.3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala]
 It's using ResolveReferences.dedupRight which as it says just de duplicates 
the right side references from the left side (this might be a naive 
understanding of it).

Then if you just alias one of these columns it's fine. But that really 
shouldn't be required for the logical plan to be accurate.

 

 


was (Author: eeveeb):
I'd love the chance to bug patch this.

I've included a simplified version of the python script which produces it, if 
you switch out the second join to the commented join it works as it should. 
!zombie-analysis.py|width=7,height=7,align=absmiddle!

What's happening is it's re-aliasing the right side of the join because the 
left and right refer to the same base column. When it does this it renames all 
the columns in the right side of the join to the new alias but not the column 
which is actually a part of the join.

Then because the join refers to the column which hasn't been updated it now 
refers to the left side of the join. So it does a cartesian join on itself and 
straps on the right side of the join on the end.

The part of the code which is doing the renaming is:
[https://github.com/apache/spark/blob/v2.3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala]
It's using ResolveReferences.dedupRight which as it says just de duplicates the 
right side references from the left side (this might be a naive understanding 
of it).

Then if you just alias one of these columns it's fine. But that really 
shouldn't be required for the logical plan to be accurate.

 

 

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-31 Thread Evelyn Bayes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598284#comment-16598284
 ] 

Evelyn Bayes edited comment on SPARK-25150 at 8/31/18 6:00 AM:
---

Sorry my attachment doesn't want to stick, feel free to ask me to email it or 
explain to me how it works. Sorry!

 


was (Author: eeveeb):
Sorry my attachment doesn't want to stick,I'll give it another try.

 

[^zombie-analysis.py]

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-30 Thread Evelyn Bayes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598275#comment-16598275
 ] 

Evelyn Bayes edited comment on SPARK-25150 at 8/31/18 5:55 AM:
---

I'd love the chance to bug patch this.

I've included a simplified version of the python script which produces it, if 
you switch out the second join to the commented join it works as it should. 
!zombie-analysis.py|width=7,height=7,align=absmiddle!

What's happening is it's re-aliasing the right side of the join because the 
left and right refer to the same base column. When it does this it renames all 
the columns in the right side of the join to the new alias but not the column 
which is actually a part of the join.

Then because the join refers to the column which hasn't been updated it now 
refers to the left side of the join. So it does a cartesian join on itself and 
straps on the right side of the join on the end.

The part of the code which is doing the renaming is:
[https://github.com/apache/spark/blob/v2.3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala]
It's using ResolveReferences.dedupRight which as it says just de duplicates the 
right side references from the left side (this might be a naive understanding 
of it).

Then if you just alias one of these columns it's fine. But that really 
shouldn't be required for the logical plan to be accurate.

 

 


was (Author: eeveeb):
I'd love the chance to bug patch this.

I've included a simplified version of the python script which produces it, if 
you switch out the second join to the commented join it works as it should.

 

 

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584239#comment-16584239
 ] 

Nicholas Chammas edited comment on SPARK-25150 at 8/17/18 6:15 PM:
---

I know there are a bunch of pending bug fixes in 2.3.2. I'm not sure if this is 
covered by any of them, and didn't have time to setup 2.3.2 to see if this 
problem is still present there. I will be away for some time and thought it 
best to report this now in case someone can pick it up and investigate further 
until I get back.

cc [~marmbrus].


was (Author: nchammas):
I know there are a bunch of pending bug fixes in 2.3.2. I'm not sure if this is 
covered by any of them, and didn't have time to setup 2.3.2 to see if this 
problem is still present there.

cc [~marmbrus].

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org