[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce Robbins updated SPARK-45580: ---------------------------------- Description: A query can have an incorrect output schema because of a subquery. Assume this data: {noformat} create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); cache table t1; cache table t2; cache table t3; {noformat} When run in {{spark-sql}}, the following query has a superfluous boolean column: {noformat} select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true {noformat} The result should be: {noformat} 1 2 3 {noformat} When executed via the {{Dataset}} API, you don't see the incorrect result, because the Dataset API truncates the right-side of the rows based on the analyzed plan's schema (it's the optimized plan's schema that goes wrong). However, even with the {{Dataset}} API, this query goes wrong: {noformat} select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis at scala.Predef$.assert(Predef.scala:279) at org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) at scala.collection.AbstractIterable.foreach(Iterable.scala:933) ... {noformat} Other queries that have the wrong schema: {noformat} select * from t1 where a in ( select c1 from t2 where a in (select col1 from t3) ); {noformat} and {noformat} select * from t1 where not exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); {noformat} was: A query can have an incorrect output schema because of a subquery. Assume this data: {noformat} create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); cache table t1; cache table t2; cache table t3; {noformat} When run in {{spark-sql}}, the following query has a superfluous boolean column: {noformat} select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true {noformat} The result should be: {noformat} 1 2 3 {noformat} When executed via the {{Dataset}} API, you don't see this result, because the Dataset API truncates the right-side of the rows based on the analyzed plan's schema (it's the optimized plan's schema that goes wrong). However, even with the {{Dataset}} API, this query goes wrong: {noformat} select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis at scala.Predef$.assert(Predef.scala:279) at org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) at scala.collection.AbstractIterable.foreach(Iterable.scala:933) ... {noformat} Other queries that have the wrong schema: {noformat} select * from t1 where a in ( select c1 from t2 where a in (select col1 from t3) ); {noformat} and {noformat} select * from t1 where not exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); {noformat} > RewritePredicateSubquery unexpectedly changes the output schema of certain > queries > ---------------------------------------------------------------------------------- > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.3, 3.4.1, 3.5.0 > Reporter: Bruce Robbins > Priority: Major > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org