[ https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735944#comment-17735944 ]
Bruce Robbins edited comment on SPARK-44132 at 6/22/23 1:51 AM: ---------------------------------------------------------------- You may have this figured out already, but in case not, here's a clue. You can replicate the NPE in {{spark-shell}} as follows: {noformat} val dsA = Seq((1, 1)).toDF("id", "a") val dsB = Seq((2, 2)).toDF("id", "a") val dsC = Seq((3, 3)).toDF("id", "a") val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), "full_outer"); joined.collectAsList {noformat} I think its because the join column sequence {{idSeq}} (in your unit test) is provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream: {noformat} scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter( Collections.singletonList("id") ).asScala.toSeq; | | res2: Seq[String] = Stream(id, ?) scala> {noformat} This seems to a bug in the handling of the join columns, but only in the case where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, SPARK-38221, SPARK-26680). was (Author: bersprockets): You may have this figured out already, but in case not, here's a clue. You can replicate the NPE in {{spark-shell}} as follows: {noformat} val dsA = Seq((1, 1)).toDF("id", "a") val dsB = Seq((2, 2)).toDF("id", "a") val dsC = Seq((3, 3)).toDF("id", "a") val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), "full_outer"); joined.collectAsList {noformat} I think its because the join column sequence {{idSeq}} (in your unit test) is provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream: {noformat} scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter( Collections.singletonList("id") ).asScala.toSeq; | | res2: Seq[String] = Stream(id, ?) scala> {noformat} This seems to a bug in the handling of the join columns, but only in the case where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, SPARK-38221). > nesting full outer joins confuses code generator > ------------------------------------------------ > > Key: SPARK-44132 > URL: https://issues.apache.org/jira/browse/SPARK-44132 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.0, 3.4.0, 3.5.0 > Environment: We verified the existence of this bug from spark 3.3 > until spark 3.5. > Reporter: Steven Aerts > Priority: Major > > We are seeing issues with the code generator when querying java bean encoded > data with 2 nested joins. > {code:java} > dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); > {code} > will generate invalid code in the code generator. And can depending on the > data used generate stack traces like: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > Or: > {code:java} > Caused by: java.lang.AssertionError: index (2) should < 2 > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > When we look at the generated code we see that the code generator seems to be > mixing up parameters. For example: > {code:java} > if (smj_leftOutputRow_0 != null) { //<==== null > check for wrong/left parameter > boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //<==== causes > NPE on right parameter here{code} > It is as if the the nesting of 2 full outer joins is confusing the code > generator and as such generating invalid code. > There is one other strange thing. We found this issue when using data sets > which were using the java bean encoder. We tried to reproduce this in the > spark shell or using scala case classes but were unable to do so. > We made a reproduction scenario as unit tests (one for each of the stacktrace > above) on the spark code base and made it available as a [pull > request|https://github.com/apache/spark/pull/41688] to this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org