[jira] [Comment Edited] (SPARK-44132) nesting full outer joins confuses code generator

Bruce Robbins (Jira) Wed, 21 Jun 2023 18:52:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735944#comment-17735944
 ]


Bruce Robbins edited comment on SPARK-44132 at 6/22/23 1:51 AM:
----------------------------------------------------------------

You may have this figured out already, but in case not, here's a clue.

You can replicate the NPE in {{spark-shell}} as follows:
{noformat}
val dsA = Seq((1, 1)).toDF("id", "a")
val dsB = Seq((2, 2)).toDF("id", "a")
val dsC = Seq((3, 3)).toDF("id", "a")

val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), 
"full_outer");
joined.collectAsList
{noformat}

I think its because the join column sequence {{idSeq}} (in your unit test) is 
provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream:
{noformat}
scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter(
            Collections.singletonList("id")
    ).asScala.toSeq;
     |      | res2: Seq[String] = Stream(id, ?)

scala> 
{noformat}
This seems to a bug in the handling of the join columns, but only in the case 
where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, 
SPARK-38221, SPARK-26680).


was (Author: bersprockets):
You may have this figured out already, but in case not, here's a clue.

You can replicate the NPE in {{spark-shell}} as follows:
{noformat}
val dsA = Seq((1, 1)).toDF("id", "a")
val dsB = Seq((2, 2)).toDF("id", "a")
val dsC = Seq((3, 3)).toDF("id", "a")

val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), 
"full_outer");
joined.collectAsList
{noformat}

I think its because the join column sequence {{idSeq}} (in your unit test) is 
provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream:
{noformat}
scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter(
            Collections.singletonList("id")
    ).asScala.toSeq;
     |      | res2: Seq[String] = Stream(id, ?)

scala> 
{noformat}
This seems to a bug in the handling of the join columns, but only in the case 
where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, 
SPARK-38221).

> nesting full outer joins confuses code generator
> ------------------------------------------------
>
>                 Key: SPARK-44132
>                 URL: https://issues.apache.org/jira/browse/SPARK-44132
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0, 3.4.0, 3.5.0
>         Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>            Reporter: Steven Aerts
>            Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {                          //<==== null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //<==== causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44132) nesting full outer joins confuses code generator

Reply via email to