[jira] [Updated] (SPARK-45580) RewritePredicateSubquery unexpectedly changes the output schema of certain queries

Bruce Robbins (Jira) Tue, 17 Oct 2023 14:28:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bruce Robbins updated SPARK-45580:
----------------------------------
    Description: 
A query can have an incorrect output schema because of a subquery.

Assume this data:
{noformat}
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);
cache table t1;
cache table t2;
cache table t3;
{noformat}
When run in {{spark-sql}}, the following query has a superfluous boolean column:
{noformat}
select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1       false
2       false
3       true
{noformat}
The result should be:
{noformat}
1
2
3
{noformat}
When executed via the {{Dataset}} API, you don't see the incorrect result, 
because the Dataset API truncates the right-side of the rows based on the 
analyzed plan's schema (it's the optimized plan's schema that goes wrong).

However, even with the {{Dataset}} API, this query goes wrong:
{noformat}
select (
  select *
  from t1
  where exists (
    select c1
    from t2
    where a = c1
    or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
        at scala.Predef$.assert(Predef.scala:279)
        at 
org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
        at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
        at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
        ...
{noformat}
Other queries that have the wrong schema:
{noformat}
select *
from t1
where a in (
  select c1
  from t2
  where a in (select col1 from t3)
);
{noformat}
and
{noformat}
select *
from t1
where not exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);
{noformat}


  was:
A query can have an incorrect output schema because of a subquery.

Assume this data:
{noformat}
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);
cache table t1;
cache table t2;
cache table t3;
{noformat}
When run in {{spark-sql}}, the following query has a superfluous boolean column:
{noformat}
select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1       false
2       false
3       true
{noformat}
The result should be:
{noformat}
1
2
3
{noformat}
When executed via the {{Dataset}} API, you don't see this result, because the 
Dataset API truncates the right-side of the rows based on the analyzed plan's 
schema (it's the optimized plan's schema that goes wrong).

However, even with the {{Dataset}} API, this query goes wrong:
{noformat}
select (
  select *
  from t1
  where exists (
    select c1
    from t2
    where a = c1
    or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
        at scala.Predef$.assert(Predef.scala:279)
        at 
org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
        at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
        at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
        ...
{noformat}
Other queries that have the wrong schema:
{noformat}
select *
from t1
where a in (
  select c1
  from t2
  where a in (select col1 from t3)
);
{noformat}
and
{noformat}
select *
from t1
where not exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);
{noformat}



> RewritePredicateSubquery unexpectedly changes the output schema of certain 
> queries
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-45580
>                 URL: https://issues.apache.org/jira/browse/SPARK-45580
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.3, 3.4.1, 3.5.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> A query can have an incorrect output schema because of a subquery.
> Assume this data:
> {noformat}
> create or replace temp view t1(a) as values (1), (2), (3), (7);
> create or replace temp view t2(c1) as values (1), (2), (3);
> create or replace temp view t3(col1) as values (3), (9);
> cache table t1;
> cache table t2;
> cache table t3;
> {noformat}
> When run in {{spark-sql}}, the following query has a superfluous boolean 
> column:
> {noformat}
> select *
> from t1
> where exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> 1     false
> 2     false
> 3     true
> {noformat}
> The result should be:
> {noformat}
> 1
> 2
> 3
> {noformat}
> When executed via the {{Dataset}} API, you don't see the incorrect result, 
> because the Dataset API truncates the right-side of the rows based on the 
> analyzed plan's schema (it's the optimized plan's schema that goes wrong).
> However, even with the {{Dataset}} API, this query goes wrong:
> {noformat}
> select (
>   select *
>   from t1
>   where exists (
>     select c1
>     from t2
>     where a = c1
>     or a in (select col1 from t3)
>   )
>   limit 1
> )
> from range(1);
> java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
> something went wrong in analysis
>       at scala.Predef$.assert(Predef.scala:279)
>       at 
> org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
>       at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
>       at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
>       at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
>       at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
>       at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
>         ...
> {noformat}
> Other queries that have the wrong schema:
> {noformat}
> select *
> from t1
> where a in (
>   select c1
>   from t2
>   where a in (select col1 from t3)
> );
> {noformat}
> and
> {noformat}
> select *
> from t1
> where not exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45580) RewritePredicateSubquery unexpectedly changes the output schema of certain queries

Reply via email to