[jira] [Updated] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition

2024-03-28 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-47633:
--
Affects Version/s: 3.4.2

> Cache miss for queries using JOIN LATERAL with join condition
> -
>
> Key: SPARK-47633
> URL: https://issues.apache.org/jira/browse/SPARK-47633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 4.0.0, 3.5.1
>Reporter: Bruce Robbins
>Priority: Major
>
> For example:
> {noformat}
> CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
> CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);
> create or replace temp view v1 as
> select *
> from t1
> join lateral (
>   select c1 as a, c2 as b
>   from t2)
> on c1 = a;
> cache table v1;
> explain select * from v1;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false
>:- LocalTableScan [c1#180, c2#181]
>+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> false] as bigint)),false), [plan_id=113]
>   +- LocalTableScan [a#173, b#174]
> {noformat}
> Note that there is no {{InMemoryRelation}}.
> However, if you move the join condition into the subquery, the cached plan is 
> used:
> {noformat}
> CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
> CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);
> create or replace temp view v2 as
> select *
> from t1
> join lateral (
>   select c1 as a, c2 as b
>   from t2
>   where t1.c1 = t2.c1);
> cache table v2;
> explain select * from v2;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179]
>   +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- AdaptiveSparkPlan isFinalPlan=true
>+- == Final Plan ==
>   *(1) Project [c1#26, c2#27, a#19, b#20]
>   +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, 
> BuildLeft, false
>  :- BroadcastQueryStage 0
>  :  +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as 
> bigint)),false), [plan_id=37]
>  : +- LocalTableScan [c1#26, c2#27]
>  +- *(1) LocalTableScan [a#19, b#20, c1#30]
>+- == Initial Plan ==
>   Project [c1#26, c2#27, a#19, b#20]
>   +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, 
> false
>  :- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as 
> bigint)),false), [plan_id=37]
>  :  +- LocalTableScan [c1#26, c2#27]
>  +- LocalTableScan [a#19, b#20, c1#30]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition

2024-03-28 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-47633:
--
Affects Version/s: 3.5.1

> Cache miss for queries using JOIN LATERAL with join condition
> -
>
> Key: SPARK-47633
> URL: https://issues.apache.org/jira/browse/SPARK-47633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Bruce Robbins
>Priority: Major
>
> For example:
> {noformat}
> CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
> CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);
> create or replace temp view v1 as
> select *
> from t1
> join lateral (
>   select c1 as a, c2 as b
>   from t2)
> on c1 = a;
> cache table v1;
> explain select * from v1;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false
>:- LocalTableScan [c1#180, c2#181]
>+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> false] as bigint)),false), [plan_id=113]
>   +- LocalTableScan [a#173, b#174]
> {noformat}
> Note that there is no {{InMemoryRelation}}.
> However, if you move the join condition into the subquery, the cached plan is 
> used:
> {noformat}
> CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
> CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);
> create or replace temp view v2 as
> select *
> from t1
> join lateral (
>   select c1 as a, c2 as b
>   from t2
>   where t1.c1 = t2.c1);
> cache table v2;
> explain select * from v2;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179]
>   +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- AdaptiveSparkPlan isFinalPlan=true
>+- == Final Plan ==
>   *(1) Project [c1#26, c2#27, a#19, b#20]
>   +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, 
> BuildLeft, false
>  :- BroadcastQueryStage 0
>  :  +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as 
> bigint)),false), [plan_id=37]
>  : +- LocalTableScan [c1#26, c2#27]
>  +- *(1) LocalTableScan [a#19, b#20, c1#30]
>+- == Initial Plan ==
>   Project [c1#26, c2#27, a#19, b#20]
>   +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, 
> false
>  :- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as 
> bigint)),false), [plan_id=37]
>  :  +- LocalTableScan [c1#26, c2#27]
>  +- LocalTableScan [a#19, b#20, c1#30]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition

2024-03-28 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-47633:
-

 Summary: Cache miss for queries using JOIN LATERAL with join 
condition
 Key: SPARK-47633
 URL: https://issues.apache.org/jira/browse/SPARK-47633
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Bruce Robbins


For example:
{noformat}
CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);

create or replace temp view v1 as
select *
from t1
join lateral (
  select c1 as a, c2 as b
  from t2)
on c1 = a;

cache table v1;

explain select * from v1;
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false
   :- LocalTableScan [c1#180, c2#181]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
false] as bigint)),false), [plan_id=113]
  +- LocalTableScan [a#173, b#174]
{noformat}
Note that there is no {{InMemoryRelation}}.

However, if you move the join condition into the subquery, the cached plan is 
used:
{noformat}
CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);

create or replace temp view v2 as
select *
from t1
join lateral (
  select c1 as a, c2 as b
  from t2
  where t1.c1 = t2.c1);

cache table v2;

explain select * from v2;
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179]
  +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, 
memory, deserialized, 1 replicas)
+- AdaptiveSparkPlan isFinalPlan=true
   +- == Final Plan ==
  *(1) Project [c1#26, c2#27, a#19, b#20]
  +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, 
false
 :- BroadcastQueryStage 0
 :  +- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), 
[plan_id=37]
 : +- LocalTableScan [c1#26, c2#27]
 +- *(1) LocalTableScan [a#19, b#20, c1#30]
   +- == Initial Plan ==
  Project [c1#26, c2#27, a#19, b#20]
  +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, false
 :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), 
[plan_id=37]
 :  +- LocalTableScan [c1#26, c2#27]
 +- LocalTableScan [a#19, b#20, c1#30]
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47527) Cache miss for queries using With expressions

2024-03-24 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-47527.
---
Resolution: Duplicate

> Cache miss for queries using With expressions
> -
>
> Key: SPARK-47527
> URL: https://issues.apache.org/jira/browse/SPARK-47527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> For example:
> {noformat}
> create or replace temp view v1 as
> select id from range(10);
> create or replace temp view q1 as
> select * from v1
> where id between 2 and 4;
> cache table q1;
> explain select * from q1;
> == Physical Plan ==
> *(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
> +- *(1) Range (0, 10, step=1, splits=8)
> {noformat}
> Similarly:
> {noformat}
> create or replace temp view q2 as
> select count_if(id > 3) as cnt
> from v1;
> cache table q2;
> explain select * from q2;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null 
> else _common_expr_0#88)])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
>   +- HashAggregate(keys=[], functions=[partial_count(if (NOT 
> _common_expr_0#88) null else _common_expr_0#88)])
>  +- Project [(id#86L > 3) AS _common_expr_0#88]
> +- Range (0, 10, step=1, splits=8)
> {noformat}
> In the output of the above explain commands, neither include an 
> {{InMemoryRelation}} node.
> The culprit seems to be the common expression ids in the {{With}} expressions 
> used in runtime replacements for {{between}} and {{{}count_if{}}}, e.g. [this 
> code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47527) Cache misses for queries using With expressions

2024-03-23 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-47527:
-

 Summary: Cache misses for queries using With expressions
 Key: SPARK-47527
 URL: https://issues.apache.org/jira/browse/SPARK-47527
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Bruce Robbins


For example:
{noformat}
create or replace temp view v1 as
select id from range(10);

create or replace temp view q1 as
select * from v1
where id between 2 and 4;

cache table q1;

explain select * from q1;

== Physical Plan ==
*(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
+- *(1) Range (0, 10, step=1, splits=8)
{noformat}
Similarly:
{noformat}
create or replace temp view q2 as
select count_if(id > 3) as cnt
from v1;

cache table q2;

explain select * from q2;

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null else 
_common_expr_0#88)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
  +- HashAggregate(keys=[], functions=[partial_count(if (NOT 
_common_expr_0#88) null else _common_expr_0#88)])
 +- Project [(id#86L > 3) AS _common_expr_0#88]
+- Range (0, 10, step=1, splits=8)

{noformat}
In the output of the above explain commands, neither list an 
{{InMemoryRelation}} node.

The culprit seems to be the common expression ids in the {{With}} expressions 
used in runtime replacements for {{between}} and {{count_if}}, e.g. [this 
code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47527) Cache miss for queries using With expressions

2024-03-23 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-47527:
--
Description: 
For example:
{noformat}
create or replace temp view v1 as
select id from range(10);

create or replace temp view q1 as
select * from v1
where id between 2 and 4;

cache table q1;

explain select * from q1;

== Physical Plan ==
*(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
+- *(1) Range (0, 10, step=1, splits=8)
{noformat}
Similarly:
{noformat}
create or replace temp view q2 as
select count_if(id > 3) as cnt
from v1;

cache table q2;

explain select * from q2;

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null else 
_common_expr_0#88)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
  +- HashAggregate(keys=[], functions=[partial_count(if (NOT 
_common_expr_0#88) null else _common_expr_0#88)])
 +- Project [(id#86L > 3) AS _common_expr_0#88]
+- Range (0, 10, step=1, splits=8)

{noformat}
In the output of the above explain commands, neither include an 
{{InMemoryRelation}} node.

The culprit seems to be the common expression ids in the {{With}} expressions 
used in runtime replacements for {{between}} and {{{}count_if{}}}, e.g. [this 
code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43].

  was:
For example:
{noformat}
create or replace temp view v1 as
select id from range(10);

create or replace temp view q1 as
select * from v1
where id between 2 and 4;

cache table q1;

explain select * from q1;

== Physical Plan ==
*(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
+- *(1) Range (0, 10, step=1, splits=8)
{noformat}
Similarly:
{noformat}
create or replace temp view q2 as
select count_if(id > 3) as cnt
from v1;

cache table q2;

explain select * from q2;

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null else 
_common_expr_0#88)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
  +- HashAggregate(keys=[], functions=[partial_count(if (NOT 
_common_expr_0#88) null else _common_expr_0#88)])
 +- Project [(id#86L > 3) AS _common_expr_0#88]
+- Range (0, 10, step=1, splits=8)

{noformat}
In the output of the above explain commands, neither list an 
{{InMemoryRelation}} node.

The culprit seems to be the common expression ids in the {{With}} expressions 
used in runtime replacements for {{between}} and {{count_if}}, e.g. [this 
code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43].


> Cache miss for queries using With expressions
> -
>
> Key: SPARK-47527
> URL: https://issues.apache.org/jira/browse/SPARK-47527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> For example:
> {noformat}
> create or replace temp view v1 as
> select id from range(10);
> create or replace temp view q1 as
> select * from v1
> where id between 2 and 4;
> cache table q1;
> explain select * from q1;
> == Physical Plan ==
> *(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
> +- *(1) Range (0, 10, step=1, splits=8)
> {noformat}
> Similarly:
> {noformat}
> create or replace temp view q2 as
> select count_if(id > 3) as cnt
> from v1;
> cache table q2;
> explain select * from q2;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null 
> else _common_expr_0#88)])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
>   +- HashAggregate(keys=[], functions=[partial_count(if (NOT 
> _common_expr_0#88) null else _common_expr_0#88)])
>  +- Project [(id#86L > 3) AS _common_expr_0#88]
> +- Range (0, 10, step=1, splits=8)
> {noformat}
> In the output of the above explain commands, neither include an 
> {{InMemoryRelation}} node.
> The culprit seems to be the common expression ids in the {{With}} expressions 
> used in runtime replacements for {{between}} and {{{}count_if{}}}, e.g. [this 
> code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47527) Cache miss for queries using With expressions

2024-03-23 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-47527:
--
Summary: Cache miss for queries using With expressions  (was: Cache misses 
for queries using With expressions)

> Cache miss for queries using With expressions
> -
>
> Key: SPARK-47527
> URL: https://issues.apache.org/jira/browse/SPARK-47527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> For example:
> {noformat}
> create or replace temp view v1 as
> select id from range(10);
> create or replace temp view q1 as
> select * from v1
> where id between 2 and 4;
> cache table q1;
> explain select * from q1;
> == Physical Plan ==
> *(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
> +- *(1) Range (0, 10, step=1, splits=8)
> {noformat}
> Similarly:
> {noformat}
> create or replace temp view q2 as
> select count_if(id > 3) as cnt
> from v1;
> cache table q2;
> explain select * from q2;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null 
> else _common_expr_0#88)])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
>   +- HashAggregate(keys=[], functions=[partial_count(if (NOT 
> _common_expr_0#88) null else _common_expr_0#88)])
>  +- Project [(id#86L > 3) AS _common_expr_0#88]
> +- Range (0, 10, step=1, splits=8)
> {noformat}
> In the output of the above explain commands, neither list an 
> {{InMemoryRelation}} node.
> The culprit seems to be the common expression ids in the {{With}} expressions 
> used in runtime replacements for {{between}} and {{count_if}}, e.g. [this 
> code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-02-27 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821393#comment-17821393
 ] 

Bruce Robbins edited comment on SPARK-47193 at 2/27/24 8:48 PM:


Running this in Spark 3.5.0 in local mode on my laptop, I get
{noformat}
df count = 8
...
rdd count = 8
{noformat}
What is your environment and Spark configuration?

By the way, the "{{...}}" above are messages like
{noformat}
24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the 
schema.
 Header: UserId, LocationId, LocationName, CreatedDate, Status
 Schema: UserId, LocationId, LocationName, Status, CreatedDate
Expected: Status but found: CreatedDate
CSV file: file:userLocation.csv
{noformat}



was (Author: bersprockets):
Running this in Spark 3.5.0 in local mode on my laptop, I get
{noformat}
df count = 8
...
rdd count = 8
{noformat}
What is your environment and Spark configuration?

By the way, the {{...}} above are messages like
{noformat}
24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the 
schema.
 Header: UserId, LocationId, LocationName, CreatedDate, Status
 Schema: UserId, LocationId, LocationName, Status, CreatedDate
Expected: Status but found: CreatedDate
CSV file: file:userLocation.csv
{noformat}


> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
> Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
> UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
> Option[Boolean], Level: Option[Int], TimeZone: Option[Int])
> val userProfile = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage]
> val language = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage]
> val device = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage]
> val deviceClass = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage]
> val deviceType = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage]
> val location1 = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1]
> val 

[jira] [Commented] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-02-27 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821393#comment-17821393
 ] 

Bruce Robbins commented on SPARK-47193:
---

Running this in Spark 3.5.0 in local mode on my laptop, I get
{noformat}
df count = 8
...
rdd count = 8
{noformat}
What is your environment and Spark configuration?

By the way, the {{...}} above are messages like
{noformat}
24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the 
schema.
 Header: UserId, LocationId, LocationName, CreatedDate, Status
 Schema: UserId, LocationId, LocationName, Status, CreatedDate
Expected: Status but found: CreatedDate
CSV file: file:userLocation.csv
{noformat}


> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
> Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
> UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
> Option[Boolean], Level: Option[Int], TimeZone: Option[Int])
> val userProfile = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage]
> val language = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage]
> val device = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage]
> val deviceClass = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage]
> val deviceType = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage]
> val location1 = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1]
> val timeZoneLookup = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyTimeZoneLookupMessage].schema).csv("timeZoneLookup.csv").as[MyTimeZoneLookupMessage]
> val userLocation = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserLocationMessage].schema).csv("userLocation.csv").as[MyUserLocationMessage]
> val user = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> 

[jira] [Commented] (SPARK-47134) Unexpected nulls when casting decimal values in specific cases

2024-02-22 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819789#comment-17819789
 ] 

Bruce Robbins commented on SPARK-47134:
---

Oddly, I cannot reproduce on either 3.4.1 or 3.5.0.

Also, my 3.4.1 plan doesn't look like your 3.4.1 plan: My plan uses {{sum}}, 
your plan uses {{decimalsum}}. I can't find where {{decimalsum}} comes from in 
the code base, but maybe I am not looking hard enough.
{noformat}
scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", x)).toDS
ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

scala> ds.createOrReplaceTempView("t")

scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct 
FROM t GROUP BY `_1` ORDER BY ct ASC").show()
++
|  ct|
++
| 9508.00|
|13879.00|
++

scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct 
FROM t GROUP BY `_1` ORDER BY ct ASC").explain
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [ct#19 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(ct#19 ASC NULLS FIRST, 200), 
ENSURE_REQUIREMENTS, [plan_id=68]
  +- HashAggregate(keys=[_1#2], functions=[sum(1.00)])
 +- Exchange hashpartitioning(_1#2, 200), ENSURE_REQUIREMENTS, 
[plan_id=65]
+- HashAggregate(keys=[_1#2], 
functions=[partial_sum(1.00)])
   +- LocalTableScan [_1#2]

scala> sql("select version()").show(false)
+--+
|version() |
+--+
|3.4.1 6b1ff22dde1ead51cbf370be6e48a802daae58b6|
+--+

scala> 
{noformat}

> Unexpected nulls when casting decimal values in specific cases
> --
>
> Key: SPARK-47134
> URL: https://issues.apache.org/jira/browse/SPARK-47134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Dylan Walker
>Priority: Major
> Attachments: 321queryplan.txt, 341queryplan.txt
>
>
> In specific cases, casting decimal values can result in `null` values where 
> no overflow exists.
> The cases appear very specific, and I don't have the depth of knowledge to 
> generalize this issue, so here is a simple spark-shell reproduction:
> *Setup:*
> {code:scala}
> scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", 
> x)).toDS
> ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]
> scala> ds.createOrReplaceTempView("t")
> {code}
>  
> *Spark 3.2.1 behaviour (correct):*
> {code:scala}
> scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct 
> FROM t GROUP BY `_1` ORDER BY ct ASC").show()
> ++
> |  ct|
> ++
> | 9508.00|
> |13879.00|
> ++
> {code}
> *Spark 3.4.1 / Spark 3.5.0 behaviour:*
> {code:scala}
> scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct 
> FROM t GROUP BY `_1` ORDER BY ct ASC").show()
> +---+
> | ct|
> +---+
> |   null|
> |9508.00|
> +---+
> {code}
> This is fairly delicate:
>  - removing the {{ORDER BY}} clause produces the correct result
>  - removing the {{CAST}} produces the correct result
>  - changing the number of 0s in the argument to {{SUM}} produces the correct 
> result
>  - setting {{spark.ansi.enabled}} to {{true}} produces the correct result 
> (and does not throw an error)
> Also, removing the {{ORDER BY}}, but writing {{ds}} to a parquet will also 
> result in the unexpected nulls.
> Please let me know if you need additional information.
> We are also interested in understanding whether setting 
> {{spark.ansi.enabled}} can be considered a reliable workaround to this issue 
> prior to a fix being released, if possible.
> Text files that include {{explain()}} output attached.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47104) Spark SQL query fails with NullPointerException

2024-02-21 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-47104:
--
Affects Version/s: 3.5.0
   3.4.2

> Spark SQL query fails with NullPointerException
> ---
>
> Key: SPARK-47104
> URL: https://issues.apache.org/jira/browse/SPARK-47104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.4.2, 3.5.0
>Reporter: Chhavi Bansal
>Priority: Major
>
> I am trying to run a very simple SQL query involving join and orderby clause 
> and then using UUID() function in the outermost select stmt. The query fails
> {code:java}
> val df = spark.read.format("csv").option("header", 
> "true").load("src/main/resources/titanic.csv")
> df.createOrReplaceTempView("titanic")
> val query = spark.sql(" select name, uuid() as _iid from (select s.name from 
> titanic s join titanic t on s.name = t.name order by name) ;") 
> query.show() // FAILS{code}
> Dataset is a normal csv file with the following columns
> {code:java}
> PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
>  {code}
> Below is the error
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.map(TraversableLike.scala:237)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
> at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
> at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
> at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
> at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:785)
> at 
> hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14)
> at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6)
> at scala.Function0.apply$mcV$sp(Function0.scala:39)
> at scala.Function0.apply$mcV$sp$(Function0.scala:39)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
> at scala.App.$anonfun$main$1$adapted(App.scala:80)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.App.main(App.scala:80)
> at scala.App.main$(App.scala:78)
> at hyperspace2.sparkPlan$.main(sparkPlan.scala:6)
> at hyperspace2.sparkPlan.main(sparkPlan.scala) {code}
> Note:
>  # here if I remove order by clause then it produces the correct output.
>  # This happens when I read the dataset using csv file, works fine if I make 
> the dataframe using Seq().toDf
>  # The query fails if I use spark.sql("query").show() but is success when I 
> simple write it to csv file
> [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception]
> Please can someone look into why this happens just when using `show()` since 
> this is failing queries in production for me.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (SPARK-47104) Spark SQL query fails with NullPointerException

2024-02-20 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818934#comment-17818934
 ] 

Bruce Robbins commented on SPARK-47104:
---

It's not a CSV specific issue. You can reproduce with a cached view. The 
following fails on the master branch, when using {{spark-sql}}:
{noformat}
create or replace temp view v1(id, name) as values
(1, "fred"),
(2, "bob");

cache table v1;

select name, uuid() as _iid from (
  select s.name
  from v1 s
  join v1 t
  on s.name = t.name
  order by name
)
limit 20;
{noformat}
The exception is:
{noformat}
java.lang.NullPointerException: Cannot invoke 
"org.apache.spark.sql.catalyst.util.RandomUUIDGenerator.getNextUUIDUTF8String()"
 because "this.randomGen_0" is null
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$6(limit.scala:297)
at scala.collection.ArrayOps$.map$extension(ArrayOps.scala:934)
at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$1(limit.scala:297)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:286)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:390)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:418)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:390)
{noformat}
It seems that non-deterministic expressions are not getting initialized before 
being used in the unsafe projection. I can take a look.

> Spark SQL query fails with NullPointerException
> ---
>
> Key: SPARK-47104
> URL: https://issues.apache.org/jira/browse/SPARK-47104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Chhavi Bansal
>Priority: Major
>
> I am trying to run a very simple SQL query involving join and orderby clause 
> and then using UUID() function in the outermost select stmt. The query fails
> {code:java}
> val df = spark.read.format("csv").option("header", 
> "true").load("src/main/resources/titanic.csv")
> df.createOrReplaceTempView("titanic")
> val query = spark.sql(" select name, uuid() as _iid from (select s.name from 
> titanic s join titanic t on s.name = t.name order by name) ;") 
> query.show() // FAILS{code}
> Dataset is a normal csv file with the following columns
> {code:java}
> PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
>  {code}
> Below is the error
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.map(TraversableLike.scala:237)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
> at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
> at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> 

[jira] [Commented] (SPARK-47034) join between cached temp tables result in missing entries

2024-02-13 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817123#comment-17817123
 ] 

Bruce Robbins commented on SPARK-47034:
---

I wonder if this is SPARK-45592 (and, relatedly, SPARK-45282), which existed as 
a bug in 3.5.0 but is fixed on master and branch-3.5.

> join between cached temp tables result in missing entries
> -
>
> Key: SPARK-47034
> URL: https://issues.apache.org/jira/browse/SPARK-47034
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 3.5.0
>Reporter: shurik mermelshtein
>Priority: Major
>
> we create several temp tables (views) by loading several delta tables and 
> joining between them. 
> those views are used for calculation of different metrics. each metric 
> requires different views to be used. some of the more popular views are 
> cached for better performance. 
> we have noticed that once we upgraded from spark 3.4.2  to spark 3.5.0 some 
> of the join started to fail.
> we can reproduce a case were we have 2 data frames (views) (this is not the 
> real names  / values we use. this is just for the example)
>  # users with the column user_id, campaign_id, user_name.
> we make sure it has a single entry
> '11', '2', 'Jhon Doe'
>  # actions with the column user_id, campaign_id, action_id, action count
> we make sure it has a single entry
> '11', '2', 'clicks', 5
>  
>  # users view can be filtered for user_id = '11' or/and campaign_id = 
> '2' and it will find the existing single row
>  # actions view can be filtered for user_id = '11' or/and campaign_id = 
> '2' and it will find the existing single row
>  # users and actions can be inner join by user_id *OR* campaign_id and the 
> join will be successful. 
>  # users and actions can *not* be inner join by user_id *AND* campaign_id. 
> The join results in no entry.
>  # if we write both of the views to S3 and read them back to new data frames, 
> suddenly the join is working.
>  # if we disable AQE the join is working
>  # running checkpoint on the views does not make join #4 work



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47019) AQE dynamic cache partitioning causes SortMergeJoin to result in data loss

2024-02-10 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816321#comment-17816321
 ] 

Bruce Robbins commented on SPARK-47019:
---

I can reproduce on my laptop using Spark 3.5.0 and {{--master 
"local-cluster[3,1,1024]"}}. However, I can not reproduce on the latest 
branch-3.5 or master.

So it seems to have been fixed, probably by SPARK-45592.


> AQE dynamic cache partitioning causes SortMergeJoin to result in data loss
> --
>
> Key: SPARK-47019
> URL: https://issues.apache.org/jira/browse/SPARK-47019
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 3.5.0
> Environment: Tested in 3.5.0
> Reproduced on, so far:
>  * kubernetes deployment
>  * docker cluster deployment
> Local Cluster:
>  * master
>  * worker1 (2/2G)
>  * worker2 (1/1G)
>Reporter: Ridvan Appa Bugis
>Priority: Blocker
>  Labels: DAG, caching, correctness, data-loss, 
> dynamic_allocation, inconsistency, partitioning
> Attachments: Screenshot 2024-02-07 at 20.09.44.png, Screenshot 
> 2024-02-07 at 20.10.07.png, eventLogs-app-20240207175940-0023.zip, 
> testdata.zip
>
>
> It seems like we have encountered an issue with Spark AQE's dynamic cache 
> partitioning which causes incorrect *count* output values and data loss.
> A similar issue could not be found, so i am creating this ticket to raise 
> awareness.
>  
> Preconditions:
>  - Setup a cluster as per environment specification
>  - Prepare test data (or a data large enough to trigger read by both 
> executors)
> Steps to reproduce:
>  - Read parent
>  - Self join parent
>  - cache + materialize parent
>  - Join parent with child
>  
> Performing a self-join over a parentDF, then caching + materialising the DF, 
> and then joining it with a childDF results in *incorrect* count value and 
> {*}missing data{*}.
>  
> Performing a *repartition* seems to fix the issue, most probably due to 
> rearrangement of the underlying partitions and statistic update.
>  
> This behaviour is observed over a multi-worker cluster with a job running 2 
> executors (1 per worker), when reading a large enough data file by both 
> executors.
> Not reproducible in local mode.
>  
> Circumvention:
> So far, by disabling 
> _spark.sql.optimizer.canChangeCachedPlanOutputPartitioning_ or performing 
> repartition this can be alleviated, but it is not the fix of the root cause.
>  
> This issue is dangerous considering that data loss is occurring silently and 
> in absence of proper checks can lead to wrong behaviour/results down the 
> line. So we have labeled it as a blocker.
>  
> There seems to be a file-size treshold after which dataloss is observed 
> (possibly implying that it happens when both executors start reading the data 
> file)
>  
> Minimal example:
> {code:java}
> // Read parent
> val parentData = session.read.format("avro").load("/data/shared/test/parent")
> // Self join parent and cache + materialize
> val parent = parentData.join(parentData, Seq("PID")).cache()
> parent.count()
> // Read child
> val child = session.read.format("avro").load("/data/shared/test/child")
> // Basic join
> val resultBasic = child.join(
>   parent,
>   parent("PID") === child("PARENT_ID")
> )
> // Count: 16479 (Wrong)
> println(s"Count no repartition: ${resultBasic.count()}")
> // Repartition parent join
> val resultRepartition = child.join(
>   parent.repartition(),
>   parent("PID") === child("PARENT_ID")
> )
> // Count: 50094 (Correct)
> println(s"Count with repartition: ${resultRepartition.count()}") {code}
>  
> Invalid count-only DAG:
>   !Screenshot 2024-02-07 at 20.10.07.png|width=519,height=853!
> Valid repartition DAG:
> !Screenshot 2024-02-07 at 20.09.44.png|width=368,height=1219!  
>  
> Spark submit for this job:
> {code:java}
> spark-submit 
>   --class ExampleApp 
>   --packages org.apache.spark:spark-avro_2.12:3.5.0 
>   --deploy-mode cluster 
>   --master spark://spark-master:6066 
>   --conf spark.sql.autoBroadcastJoinThreshold=-1  
>   --conf spark.cores.max=3 
>   --driver-cores 1 
>   --driver-memory 1g 
>   --executor-cores 1 
>   --executor-memory 1g 
>   /path/to/test.jar
>  {code}
> The cluster should be setup to the following (worker1(m+e) worker2(e)) as to 
> split the executors onto two workers.
> I have prepared a simple github repository which contains the compilable 
> above example.
> [https://github.com/ridvanappabugis/spark-3.5-issue]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46779) Grouping by subquery with a cached relation can fail

2024-01-19 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-46779:
--
Description: 
Example:
{noformat}
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(3, 7),
(4, 5);

cache table data;

select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from 
data d2 group by all;
{noformat}
It fails with the following error:
{noformat}
[INTERNAL_ERROR] Couldn't find count(1)#163L in 
[c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L 
in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
{noformat}
If you don't cache the view, the query succeeds.

Note, in 3.4.2 and 3.5.0 the issue happens only with cached tables, not cached 
views. I think that's because cached views were not getting properly 
deduplicated in those versions.

  was:
Example:
{noformat}
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(3, 7),
(4, 5);

cache table data;

select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from 
data d2 group by all;
{noformat}
It fails with the following error:
{noformat}
[INTERNAL_ERROR] Couldn't find count(1)#163L in 
[c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L 
in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
{noformat}
If you don't cache the view, the query succeeds.


> Grouping by subquery with a cached relation can fail
> 
>
> Key: SPARK-46779
> URL: https://issues.apache.org/jira/browse/SPARK-46779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> Example:
> {noformat}
> create or replace temp view data(c1, c2) as values
> (1, 2),
> (1, 3),
> (3, 7),
> (4, 5);
> cache table data;
> select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from 
> data d2 group by all;
> {noformat}
> It fails with the following error:
> {noformat}
> [INTERNAL_ERROR] Couldn't find count(1)#163L in 
> [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
> org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L 
> in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
> {noformat}
> If you don't cache the view, the query succeeds.
> Note, in 3.4.2 and 3.5.0 the issue happens only with cached tables, not 
> cached views. I think that's because cached views were not getting properly 
> deduplicated in those versions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46779) Grouping by subquery with a cached relation can fail

2024-01-19 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-46779:
--
Affects Version/s: 3.5.0
   3.4.2

> Grouping by subquery with a cached relation can fail
> 
>
> Key: SPARK-46779
> URL: https://issues.apache.org/jira/browse/SPARK-46779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> Example:
> {noformat}
> create or replace temp view data(c1, c2) as values
> (1, 2),
> (1, 3),
> (3, 7),
> (4, 5);
> cache table data;
> select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from 
> data d2 group by all;
> {noformat}
> It fails with the following error:
> {noformat}
> [INTERNAL_ERROR] Couldn't find count(1)#163L in 
> [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
> org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L 
> in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
> {noformat}
> If you don't cache the view, the query succeeds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46779) Grouping by subquery with a cached relation can fail

2024-01-19 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-46779:
-

 Summary: Grouping by subquery with a cached relation can fail
 Key: SPARK-46779
 URL: https://issues.apache.org/jira/browse/SPARK-46779
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Bruce Robbins


Example:
{noformat}
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(3, 7),
(4, 5);

cache table data;

select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from 
data d2 group by all;
{noformat}
It fails with the following error:
{noformat}
[INTERNAL_ERROR] Couldn't find count(1)#163L in 
[c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L 
in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
{noformat}
If you don't cache the view, the query succeeds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46373) Create DataFrame Bug

2023-12-13 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796385#comment-17796385
 ] 

Bruce Robbins commented on SPARK-46373:
---

Maybe due to this (from [the docs|https://spark.apache.org/docs/3.5.0/]):

{quote}Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.8+, and R 
3.5+.{quote}

Scala 3 is not listed as a supported version.

> Create DataFrame Bug
> 
>
> Key: SPARK-46373
> URL: https://issues.apache.org/jira/browse/SPARK-46373
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Bleibtreu
>Priority: Major
>
> Scala version is 3.3.1
> Spark version is 3.5.0
> I am using spark-core 3.5.1. I am trying to create a DataFrame through the 
> reflection api, but "No TypeTag available for Person" will appear. I have 
> tried for a long time, but I still don't quite understand why TypeTag cannot 
> recognize my Person case class. 
> {code:java}
>     import sparkSession.implicits._
>     import scala.reflect.runtime.universe._
>     case class Person(name: String)
>     val a = List(Person("A"), Person("B"), Person("C"))
>     val df = sparkSession.createDataFrame(a)
>     df.show(){code}
> !https://media.discordapp.net/attachments/839723072239566878/1183747749204725821/image.png?ex=65897600=65770100=4eeba8d8499499439590a34260f8b441c6594c572c545f5f61f8dc65beeb6a4b&==webp=lossless=1178=142!
> I tested it and it is indeed a problem unique to Scala3
> There is no problem on Scala2.13
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46289) Exception when ordering by UDT in interpreted mode

2023-12-08 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-46289:
--
Priority: Minor  (was: Major)

> Exception when ordering by UDT in interpreted mode
> --
>
> Key: SPARK-46289
> URL: https://issues.apache.org/jira/browse/SPARK-46289
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.2, 3.5.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> In interpreted mode, ordering by a UDT will result in an exception. For 
> example:
> {noformat}
> import org.apache.spark.ml.linalg.{DenseVector, Vector}
> val df = Seq.tabulate(30) { x =>
>   (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + 
> 1)/100.0).toDouble, ((x + 3)/100.0).toDouble)))
> }.toDF("id", "c1", "c2", "c3")
> df.createOrReplaceTempView("df")
> // this works
> sql("select * from df order by c3").collect
> sql("set spark.sql.codegen.wholeStage=false")
> sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> // this gets an error
> sql("select * from df order by c3").collect
> {noformat}
> The second {{collect}} action results in the following exception:
> {noformat}
> org.apache.spark.SparkIllegalArgumentException: Type 
> UninitializedPhysicalType does not support ordered operations.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348)
>   at 
> org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332)
>   at 
> org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254)
> {noformat}
> Note: You don't get an error if you use {{show}} rather than {{collect}}. 
> This is because {{show}} will implicitly add a {{limit}}, in which case the 
> ordering is performed by {{TakeOrderedAndProject}} rather than 
> {{UnsafeExternalRowSorter}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46289) Exception when ordering by UDT in interpreted mode

2023-12-06 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-46289:
--
Affects Version/s: 3.3.3

> Exception when ordering by UDT in interpreted mode
> --
>
> Key: SPARK-46289
> URL: https://issues.apache.org/jira/browse/SPARK-46289
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.2, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> In interpreted mode, ordering by a UDT will result in an exception. For 
> example:
> {noformat}
> import org.apache.spark.ml.linalg.{DenseVector, Vector}
> val df = Seq.tabulate(30) { x =>
>   (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + 
> 1)/100.0).toDouble, ((x + 3)/100.0).toDouble)))
> }.toDF("id", "c1", "c2", "c3")
> df.createOrReplaceTempView("df")
> // this works
> sql("select * from df order by c3").collect
> sql("set spark.sql.codegen.wholeStage=false")
> sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> // this gets an error
> sql("select * from df order by c3").collect
> {noformat}
> The second {{collect}} action results in the following exception:
> {noformat}
> org.apache.spark.SparkIllegalArgumentException: Type 
> UninitializedPhysicalType does not support ordered operations.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348)
>   at 
> org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332)
>   at 
> org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254)
> {noformat}
> Note: You don't get an error if you use {{show}} rather than {{collect}}. 
> This is because {{show}} will implicitly add a {{limit}}, in which case the 
> ordering is performed by {{TakeOrderedAndProject}} rather than 
> {{UnsafeExternalRowSorter}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46289) Exception when ordering by UDT in interpreted mode

2023-12-06 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-46289:
-

 Summary: Exception when ordering by UDT in interpreted mode
 Key: SPARK-46289
 URL: https://issues.apache.org/jira/browse/SPARK-46289
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.2
Reporter: Bruce Robbins


In interpreted mode, ordering by a UDT will result in an exception. For example:
{noformat}
import org.apache.spark.ml.linalg.{DenseVector, Vector}

val df = Seq.tabulate(30) { x =>
  (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + 
1)/100.0).toDouble, ((x + 3)/100.0).toDouble)))
}.toDF("id", "c1", "c2", "c3")

df.createOrReplaceTempView("df")

// this works
sql("select * from df order by c3").collect

sql("set spark.sql.codegen.wholeStage=false")
sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")

// this gets an error
sql("select * from df order by c3").collect
{noformat}
The second {{collect}} action results in the following exception:
{noformat}
org.apache.spark.SparkIllegalArgumentException: Type UninitializedPhysicalType 
does not support ordered operations.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348)
at 
org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332)
at 
org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254)
{noformat}
Note: You don't get an error if you use {{show}} rather than {{collect}}. This 
is because {{show}} will implicitly add a {{limit}}, in which case the ordering 
is performed by {{TakeOrderedAndProject}} rather than 
{{UnsafeExternalRowSorter}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-12-04 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792942#comment-17792942
 ] 

Bruce Robbins commented on SPARK-45644:
---

Even though this is the original issue, I closed it as a duplicate because the 
fix was applied under SPARK-45896.

> After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException 
> "scala.Some is not a valid external type for schema of array"
> --
>
> Key: SPARK-45644
> URL: https://issues.apache.org/jira/browse/SPARK-45644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Adi Wehrli
>Priority: Major
>
> I do not really know if this is a bug, but I am at the end with my knowledge.
> A Spark job ran successfully with Spark 3.2.x and 3.3.x. 
> But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job 
> with the same data the following always occurs now:
> {code}
> scala.Some is not a valid external type for schema of array
> {code}
> The corresponding stacktrace is:
> {code}
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
> worker for task 0.0 in stage 0.0 (TID 0)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:141) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
> [spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
> worker for task 1.0 in stage 0.0 (TID 1)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> 

[jira] [Resolved] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-12-04 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-45644.
---
Resolution: Duplicate

> After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException 
> "scala.Some is not a valid external type for schema of array"
> --
>
> Key: SPARK-45644
> URL: https://issues.apache.org/jira/browse/SPARK-45644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Adi Wehrli
>Priority: Major
>
> I do not really know if this is a bug, but I am at the end with my knowledge.
> A Spark job ran successfully with Spark 3.2.x and 3.3.x. 
> But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job 
> with the same data the following always occurs now:
> {code}
> scala.Some is not a valid external type for schema of array
> {code}
> The corresponding stacktrace is:
> {code}
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
> worker for task 0.0 in stage 0.0 (TID 0)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:141) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
> [spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
> worker for task 1.0 in stage 0.0 (TID 1)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> 

[jira] [Updated] (SPARK-46189) Various Pandas functions fail in interpreted mode

2023-11-30 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-46189:
--
Description: 
Various Pandas functions ({{kurt}}, {{var}}, {{skew}}, {{cov}}, and {{stddev}}) 
fail with an unboxing-related exception when run in interpreted mode.

Here are some reproduction cases for pyspark interactive mode:
{noformat}
spark.sql("set spark.sql.codegen.wholeStage=false")
spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")

import numpy as np
import pandas as pd

import pyspark.pandas as ps

pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a")
psser = ps.from_pandas(pser)

# each of the following actions gets an unboxing error
psser.kurt()
psser.var()
psser.skew()

# set up for covariance test
pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", "b"])
psdf = ps.from_pandas(pdf)

# this gets an unboxing error
psdf.cov()

# set up for stddev resr
from pyspark.pandas.spark import functions as SF
from pyspark.sql.functions import col
from pyspark.sql import Row
df = spark.createDataFrame([Row(a=1), Row(a=2), Row(a=3), Row(a=7), Row(a=9), 
Row(a=8)])

# this gets an unboxing error
df.select(SF.stddev(col("a"), 1)).collect()
{noformat}
Exception from the first case ({{psser.kurt()}}) is
{noformat}
java.lang.ClassCastException: class java.lang.Integer cannot be cast to class 
java.lang.Double (java.lang.Integer and java.lang.Double are in module 
java.base of loader 'bootstrap')
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:112)
at 
org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.compare(PhysicalDataType.scala:184)
at scala.math.Ordering.lt(Ordering.scala:98)
at scala.math.Ordering.lt$(Ordering.scala:98)
at 
org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.lt(PhysicalDataType.scala:184)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.nullSafeEval(predicates.scala:1196)
{noformat}

  was:
Various Pandas functions ({{kurt}}, {{var}}, {{skew}}, {{cov}}, and {{stddev}}) 
fail with an unboxing-related exception when run in interpreted mode.

Here are some reproduction cases for pyspark interactive mode:
{noformat}
sql("set spark.sql.codegen.wholeStage=false")
spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")

import numpy as np
import pandas as pd

import pyspark.pandas as ps

pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a")
psser = ps.from_pandas(pser)

# each of the following actions gets an unboxing error
psser.kurt()
psser.var()
psser.skew()

# set up for covariance test
pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", "b"])
psdf = ps.from_pandas(pdf)

# this gets an unboxing error
psdf.cov()

# set up for stddev resr
from pyspark.pandas.spark import functions as SF
from pyspark.sql.functions import col
from pyspark.sql import Row
df = spark.createDataFrame([Row(a=1), Row(a=2), Row(a=3), Row(a=7), Row(a=9), 
Row(a=8)])

# this gets an unboxing error
df.select(SF.stddev(col("a"), 1)).collect()
{noformat}
Exception from the first case ({{psser.kurt()}}) is
{noformat}
java.lang.ClassCastException: class java.lang.Integer cannot be cast to class 
java.lang.Double (java.lang.Integer and java.lang.Double are in module 
java.base of loader 'bootstrap')
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:112)
at 
org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.compare(PhysicalDataType.scala:184)
at scala.math.Ordering.lt(Ordering.scala:98)
at scala.math.Ordering.lt$(Ordering.scala:98)
at 
org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.lt(PhysicalDataType.scala:184)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.nullSafeEval(predicates.scala:1196)
{noformat}


> Various Pandas functions fail in interpreted mode
> -
>
> Key: SPARK-46189
> URL: https://issues.apache.org/jira/browse/SPARK-46189
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Various Pandas functions ({{kurt}}, {{var}}, {{skew}}, {{cov}}, and 
> {{stddev}}) fail with an unboxing-related exception when run in interpreted 
> mode.
> Here are some reproduction cases for pyspark interactive mode:
> {noformat}
> spark.sql("set spark.sql.codegen.wholeStage=false")
> spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> import numpy as np
> import pandas as pd
> import pyspark.pandas as ps
> pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a")
> psser = ps.from_pandas(pser)
> # each of the following actions gets an unboxing error
> psser.kurt()
> psser.var()
> psser.skew()
> # set up for 

[jira] [Created] (SPARK-46189) Various Pandas functions fail in interpreted mode

2023-11-30 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-46189:
-

 Summary: Various Pandas functions fail in interpreted mode
 Key: SPARK-46189
 URL: https://issues.apache.org/jira/browse/SPARK-46189
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark, SQL
Affects Versions: 3.5.0, 3.4.1
Reporter: Bruce Robbins


Various Pandas functions ({{kurt}}, {{var}}, {{skew}}, {{cov}}, and {{stddev}}) 
fail with an unboxing-related exception when run in interpreted mode.

Here are some reproduction cases for pyspark interactive mode:
{noformat}
sql("set spark.sql.codegen.wholeStage=false")
spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")

import numpy as np
import pandas as pd

import pyspark.pandas as ps

pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a")
psser = ps.from_pandas(pser)

# each of the following actions gets an unboxing error
psser.kurt()
psser.var()
psser.skew()

# set up for covariance test
pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", "b"])
psdf = ps.from_pandas(pdf)

# this gets an unboxing error
psdf.cov()

# set up for stddev resr
from pyspark.pandas.spark import functions as SF
from pyspark.sql.functions import col
from pyspark.sql import Row
df = spark.createDataFrame([Row(a=1), Row(a=2), Row(a=3), Row(a=7), Row(a=9), 
Row(a=8)])

# this gets an unboxing error
df.select(SF.stddev(col("a"), 1)).collect()
{noformat}
Exception from the first case ({{psser.kurt()}}) is
{noformat}
java.lang.ClassCastException: class java.lang.Integer cannot be cast to class 
java.lang.Double (java.lang.Integer and java.lang.Double are in module 
java.base of loader 'bootstrap')
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:112)
at 
org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.compare(PhysicalDataType.scala:184)
at scala.math.Ordering.lt(Ordering.scala:98)
at scala.math.Ordering.lt$(Ordering.scala:98)
at 
org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.lt(PhysicalDataType.scala:184)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.nullSafeEval(predicates.scala:1196)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]

2023-11-11 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785234#comment-17785234
 ] 

Bruce Robbins commented on SPARK-45896:
---

I think I have a handle on this and will make a PR shortly.

> Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
> --
>
> Key: SPARK-45896
> URL: https://issues.apache.org/jira/browse/SPARK-45896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following action fails on 3.4.1, 3.5.0, and master:
> {noformat}
> scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
> val df = Seq(Seq(Some(Seq(0.toDF("a")
> org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed 
> to encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
> unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
> scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
> AS value#0 to a row. SQLSTATE: 42846
> ...
> Caused by: java.lang.RuntimeException: scala.Some is not a valid external 
> type for schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
>  Source)
> ...
> {noformat}
> However, it succeeds on 3.3.3:
> {noformat}
> scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: array>]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
> {noformat}
> Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master:
> {noformat}
> scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed 
> to encode a value of the expressions: 
> externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
> ObjectType(class java.lang.Object), false, -1), 
> assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
> ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
> lambdavariable(ExternalMapToCatalyst_value, ObjectType(class 
> java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -3), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), IntegerType, IntegerType)), 
> unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
> validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
> ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
> ObjectType(class scala.Option))), None), input[0, 
> scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
> ...
> Caused by: java.lang.RuntimeException: scala.Some is not a valid external 
> type for schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
>  Source)
> ...
> {noformat}
> As with the first example, this succeeds on 3.3.3:
> {noformat}
> scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: map>]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
> {noformat}
> Other cases the fail on 3.4.1, 3.5.0, and master but work fine on 3.3.3:
> - {{Seq[Option[Timestamp]]}}
> - {{Map[Option[Timestamp]]}}
> - {{Seq[Option[Date]]}}
> - {{Map[Option[Date]]}}
> - {{Seq[Option[BigDecimal]]}}
> - {{Map[Option[BigDecimal]]}}
> However, the following work fine on 3.3.3, 3.4.1, 3.5.0, and master:
> - {{Seq[Option[Map]]}}
> - {{Map[Option[Map]]}}
> - {{Seq[Option[]]}}
> - {{Map[Option[]]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]

2023-11-11 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45896:
--
Description: 
The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
{noformat}
Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: 
externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), 
assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), 
true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), 
assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
ObjectType(class scala.Option))), None), input[0, 
scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
As with the first example, this succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
{noformat}
Other cases the fail on 3.4.1, 3.5.0, and master but work fine on 3.3.3:
- {{Seq[Option[Timestamp]]}}
- {{Map[Option[Timestamp]]}}
- {{Seq[Option[Date]]}}
- {{Map[Option[Date]]}}
- {{Seq[Option[BigDecimal]]}}
- {{Map[Option[BigDecimal]]}}

However, the following work fine on 3.3.3, 3.4.1, 3.5.0, and master:

- {{Seq[Option[Map]]}}
- {{Map[Option[Map]]}}
- {{Seq[Option[]]}}
- {{Map[Option[]]}}

  was:
The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: 

[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]

2023-11-11 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45896:
--
Summary: Expression encoding fails for Seq/Map of 
Option[Seq/Date/Timestamp/BigDecimal]  (was: Expression encoding fails for 
Seq/Map of Option[Seq])

> Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
> --
>
> Key: SPARK-45896
> URL: https://issues.apache.org/jira/browse/SPARK-45896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following action fails on 3.4.1, 3.5.0, and master:
> {noformat}
> scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
> val df = Seq(Seq(Some(Seq(0.toDF("a")
> org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed 
> to encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
> unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
> scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
> AS value#0 to a row. SQLSTATE: 42846
> ...
> Caused by: java.lang.RuntimeException: scala.Some is not a valid external 
> type for schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
>  Source)
> ...
> {noformat}
> However, it succeeds on 3.3.3:
> {noformat}
> scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: array>]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
> {noformat}
> Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master:
> {noformat}
> scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed 
> to encode a value of the expressions: 
> externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
> ObjectType(class java.lang.Object), false, -1), 
> assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
> ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
> lambdavariable(ExternalMapToCatalyst_value, ObjectType(class 
> java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -3), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), IntegerType, IntegerType)), 
> unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
> validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
> ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
> ObjectType(class scala.Option))), None), input[0, 
> scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
> ...
> Caused by: java.lang.RuntimeException: scala.Some is not a valid external 
> type for schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
>  Source)
> ...
> {noformat}
> As with the first example, this succeeds on 3.3.3:
> {noformat}
> scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: map>]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq]

2023-11-11 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45896:
--
Description: 
The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
{noformat}
Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: 
externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), 
assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), 
true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), 
assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
ObjectType(class scala.Option))), None), input[0, 
scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
As with the first example, this succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
{noformat}

  was:
The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
{noformat}
Map of option of sequence also fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 

[jira] [Created] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq]

2023-11-11 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-45896:
-

 Summary: Expression encoding fails for Seq/Map of Option[Seq]
 Key: SPARK-45896
 URL: https://issues.apache.org/jira/browse/SPARK-45896
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.1
Reporter: Bruce Robbins


The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
{noformat}
Map of option of sequence also fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: 
externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), 
assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), 
true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), 
assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
ObjectType(class scala.Option))), None), input[0, 
scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
As with the first example, this succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45797) Discrepancies in PySpark DataFrame Results When Using Window Functions and Filters

2023-11-05 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783015#comment-17783015
 ] 

Bruce Robbins commented on SPARK-45797:
---

I wonder if this is the same as SPARK-45543, which had two window specs and 
then produced wrong answers when filtered on rank = 1.

> Discrepancies in PySpark DataFrame Results When Using Window Functions and 
> Filters
> --
>
> Key: SPARK-45797
> URL: https://issues.apache.org/jira/browse/SPARK-45797
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Python 3.10
> Pyspark 3.5.0
> Ubuntu 22.04.3 LTS
>Reporter: Daniel Diego Horcajuelo
>Priority: Major
> Fix For: 3.5.0
>
>
> When doing certain types of transformations on a dataframe which involve 
> window functions with filters I am getting the wrong results. Here is a 
> minimal example of the results I get with my code:
>  
> {code:java}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> from pyspark.sql.window import Window as w
> from datetime import datetime, date
> spark = SparkSession.builder.config("spark.sql.repl.eagerEval.enabled", 
> True).getOrCreate()
> # Base dataframe
> df = spark.createDataFrame(
> [
> (1, date(2023, 10, 1), date(2023, 10, 2), "open"),
> (1, date(2023, 10, 2), date(2023, 10, 3), "close"),
> (2, date(2023, 10, 1), date(2023, 10, 2), "close"),
> (2, date(2023, 10, 2), date(2023, 10, 4), "close"),
> (3, date(2023, 10, 2), date(2023, 10, 4), "open"),
> (3, date(2023, 10, 3), date(2023, 10, 6), "open"),
> ],
> schema="id integer, date_start date, date_end date, status string"
> )
> # We define two partition functions
> partition = w.partitionBy("id").orderBy("date_start", 
> "date_end").rowsBetween(w.unboundedPreceding, w.unboundedFollowing)
> partition2 = w.partitionBy("id").orderBy("date_start", "date_end")
> # Define dataframe A
> A = df.withColumn(
> "date_end_of_last_close",
> f.max(f.when(f.col("status") == "close", 
> f.col("date_end"))).over(partition)
> ).withColumn(
> "rank",
> f.row_number().over(partition2)
> )
> display(A)
> | id | date_start | date_end   | status | date_end_of_last_close | rank |
> ||||||--|
> | 1  | 2023-10-01 | 2023-10-02 | open   | 2023-10-03 | 1|
> | 1  | 2023-10-02 | 2023-10-03 | close  | 2023-10-03 | 2|
> | 2  | 2023-10-01 | 2023-10-02 | close  | 2023-10-04 | 1|
> | 2  | 2023-10-02 | 2023-10-04 | close  | 2023-10-04 | 2|
> | 3  | 2023-10-02 | 2023-10-04 | open   | NULL   | 1|
> | 3  | 2023-10-03 | 2023-10-06 | open   | NULL   | 2|
> # When filtering by rank = 1, I get this weird result
> A_result = A.filter(f.col("rank") == 1).drop("rank")
> display(A_result)
> | id | date_start | date_end   | status | date_end_of_last_close |
> ||||||
> | 1  | 2023-10-01 | 2023-10-02 | open   | NULL   |
> | 2  | 2023-10-01 | 2023-10-02 | close  | 2023-10-02 |
> | 3  | 2023-10-02 | 2023-10-04 | open   | NULL   | {code}
> I think spark engine might be managing wrongly the internal partitions. If 
> creating the dataframe from scratch (without transformations), the filtering 
> operation returns the right result. In pyspark 3.4.0 this error doesn't 
> happen.
>  
> For more details, please check out this same question in stackoverflow: 
> [stackoverflow 
> question|https://stackoverflow.com/questions/77396807/discrepancies-in-pyspark-dataframe-results-when-using-window-functions-and-filte?noredirect=1#comment136446225_77396807]
>  
> I'll mark this issue as important because it affects some basic operations 
> which are daily used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-10-31 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781531#comment-17781531
 ] 

Bruce Robbins commented on SPARK-45644:
---

I will look into it and try to submit a fix. If I can't, I will ping someone 
who can.

> After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException 
> "scala.Some is not a valid external type for schema of array"
> --
>
> Key: SPARK-45644
> URL: https://issues.apache.org/jira/browse/SPARK-45644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Adi Wehrli
>Priority: Major
>
> I do not really know if this is a bug, but I am at the end with my knowledge.
> A Spark job ran successfully with Spark 3.2.x and 3.3.x. 
> But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job 
> with the same data the following always occurs now:
> {code}
> scala.Some is not a valid external type for schema of array
> {code}
> The corresponding stacktrace is:
> {code}
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
> worker for task 0.0 in stage 0.0 (TID 0)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:141) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
> [spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
> worker for task 1.0 in stage 0.0 (TID 1)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> 

[jira] [Commented] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-10-31 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781494#comment-17781494
 ] 

Bruce Robbins commented on SPARK-45644:
---

OK, I can reproduce. I will take a look. I will also try to get my reproduction 
example down to a minimal case and will post here later.

> After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException 
> "scala.Some is not a valid external type for schema of array"
> --
>
> Key: SPARK-45644
> URL: https://issues.apache.org/jira/browse/SPARK-45644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Adi Wehrli
>Priority: Major
>
> I do not really know if this is a bug, but I am at the end with my knowledge.
> A Spark job ran successfully with Spark 3.2.x and 3.3.x. 
> But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job 
> with the same data the following always occurs now:
> {code}
> scala.Some is not a valid external type for schema of array
> {code}
> The corresponding stacktrace is:
> {code}
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
> worker for task 0.0 in stage 0.0 (TID 0)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:141) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
> [spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
> worker for task 1.0 in stage 0.0 (TID 1)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> 

[jira] [Commented] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-10-30 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781091#comment-17781091
 ] 

Bruce Robbins commented on SPARK-45644:
---

You can turn on display of the generated code by adding the following to your 
log4j conf:
{noformat}
logger.codegen.name = 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator
logger.codegen.level = debug
{noformat}
Do you have any application code you can share? It looks like the error happens 
at the start of the job (task 0 stage 0).

> After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException 
> "scala.Some is not a valid external type for schema of array"
> --
>
> Key: SPARK-45644
> URL: https://issues.apache.org/jira/browse/SPARK-45644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Adi Wehrli
>Priority: Major
>
> I do not really know if this is a bug, but I am at the end with my knowledge.
> A Spark job ran successfully with Spark 3.2.x and 3.3.x. 
> But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job 
> with the same data the following always occurs now:
> {code}
> scala.Some is not a valid external type for schema of array
> {code}
> The corresponding stacktrace is:
> {code}
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
> worker for task 0.0 in stage 0.0 (TID 0)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:141) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
> [spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
> worker for task 1.0 in stage 0.0 (TID 1)"
> java.lang.RuntimeException: 

[jira] [Updated] (SPARK-45580) Subquery changes the output schema of outer query

2023-10-21 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45580:
--
Summary: Subquery changes the output schema of outer query  (was: 
RewritePredicateSubquery unexpectedly changes the output schema of certain 
queries)

> Subquery changes the output schema of outer query
> -
>
> Key: SPARK-45580
> URL: https://issues.apache.org/jira/browse/SPARK-45580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> A query can have an incorrect output schema because of a subquery.
> Assume this data:
> {noformat}
> create or replace temp view t1(a) as values (1), (2), (3), (7);
> create or replace temp view t2(c1) as values (1), (2), (3);
> create or replace temp view t3(col1) as values (3), (9);
> cache table t1;
> cache table t2;
> cache table t3;
> {noformat}
> When run in {{spark-sql}}, the following query has a superfluous boolean 
> column:
> {noformat}
> select *
> from t1
> where exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> 1 false
> 2 false
> 3 true
> {noformat}
> The result should be:
> {noformat}
> 1
> 2
> 3
> {noformat}
> When executed via the {{Dataset}} API, you don't see the incorrect result, 
> because the Dataset API truncates the right-side of the rows based on the 
> analyzed plan's schema (it's the optimized plan's schema that goes wrong).
> However, even with the {{Dataset}} API, this query goes wrong:
> {noformat}
> select (
>   select *
>   from t1
>   where exists (
> select c1
> from t2
> where a = c1
> or a in (select col1 from t3)
>   )
>   limit 1
> )
> from range(1);
> java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
> something went wrong in analysis
>   at scala.Predef$.assert(Predef.scala:279)
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
>   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
>   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
> ...
> {noformat}
> Other queries that have the wrong schema:
> {noformat}
> select *
> from t1
> where a in (
>   select c1
>   from t2
>   where a in (select col1 from t3)
> );
> {noformat}
> and
> {noformat}
> select *
> from t1
> where not exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45580) Subquery changes the output schema of the outer query

2023-10-21 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45580:
--
Summary: Subquery changes the output schema of the outer query  (was: 
Subquery changes the output schema of outer query)

> Subquery changes the output schema of the outer query
> -
>
> Key: SPARK-45580
> URL: https://issues.apache.org/jira/browse/SPARK-45580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> A query can have an incorrect output schema because of a subquery.
> Assume this data:
> {noformat}
> create or replace temp view t1(a) as values (1), (2), (3), (7);
> create or replace temp view t2(c1) as values (1), (2), (3);
> create or replace temp view t3(col1) as values (3), (9);
> cache table t1;
> cache table t2;
> cache table t3;
> {noformat}
> When run in {{spark-sql}}, the following query has a superfluous boolean 
> column:
> {noformat}
> select *
> from t1
> where exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> 1 false
> 2 false
> 3 true
> {noformat}
> The result should be:
> {noformat}
> 1
> 2
> 3
> {noformat}
> When executed via the {{Dataset}} API, you don't see the incorrect result, 
> because the Dataset API truncates the right-side of the rows based on the 
> analyzed plan's schema (it's the optimized plan's schema that goes wrong).
> However, even with the {{Dataset}} API, this query goes wrong:
> {noformat}
> select (
>   select *
>   from t1
>   where exists (
> select c1
> from t2
> where a = c1
> or a in (select col1 from t3)
>   )
>   limit 1
> )
> from range(1);
> java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
> something went wrong in analysis
>   at scala.Predef$.assert(Predef.scala:279)
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
>   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
>   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
> ...
> {noformat}
> Other queries that have the wrong schema:
> {noformat}
> select *
> from t1
> where a in (
>   select c1
>   from t2
>   where a in (select col1 from t3)
> );
> {noformat}
> and
> {noformat}
> select *
> from t1
> where not exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45583) Spark SQL returning incorrect values for full outer join on keys with the same name.

2023-10-20 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-45583.
---
Resolution: Fixed

> Spark SQL returning incorrect values for full outer join on keys with the 
> same name.
> 
>
> Key: SPARK-45583
> URL: https://issues.apache.org/jira/browse/SPARK-45583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Huw
>Priority: Major
> Fix For: 3.5.0
>
>
> {{The following query gives the wrong results.}}
>  
> {{WITH people as (}}
> {{  SELECT * FROM (VALUES }}
> {{    (1, 'Peter'), }}
> {{    (2, 'Homer'), }}
> {{    (3, 'Ned'),}}
> {{    (3, 'Jenny')}}
> {{  ) AS Idiots(id, FirstName)}}
> {{{}){}}}{{{}, location as ({}}}
> {{  SELECT * FROM (VALUES}}
> {{    (1, 'sample0'),}}
> {{    (1, 'sample1'),}}
> {{    (2, 'sample2')  }}
> {{  ) as Locations(id, address)}}
> {{{}){}}}{{{}SELECT{}}}
> {{  *}}
> {{FROM}}
> {{  people}}
> {{FULL OUTER JOIN}}
> {{  location}}
> {{ON}}
> {{  people.id = location.id}}
> {{We find the following table:}}
> ||id: integer||FirstName: string||id: integer||address: string||
> |2|Homer|2|sample2|
> |null|Ned|null|null|
> |null|Jenny|null|null|
> |1|Peter|1|sample0|
> |1|Peter|1|sample1|
> {{But clearly the first `id` column is wrong, the nulls should be 3.}}
> If we rename the id column in (only) the person table to pid we get the 
> correct results:
> ||pid: integer||FirstName: string||id: integer||address: string||
> |2|Homer|2|sample2|
> |3|Ned|null|null|
> |3|Jenny|null|null|
> |1|Peter|1|sample0|
> |1|Peter|1|sample1|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45601) stackoverflow when executing rule ExtractWindowExpressions

2023-10-19 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1304#comment-1304
 ] 

Bruce Robbins commented on SPARK-45601:
---

Possibly SPARK-38666

> stackoverflow when executing rule ExtractWindowExpressions
> --
>
> Key: SPARK-45601
> URL: https://issues.apache.org/jira/browse/SPARK-45601
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: JacobZheng
>Priority: Major
>
> I am encountering stackoverflow errors while executing the following test 
> case. I looked at the source code and it is ExtractWindowExpressions not 
> extracting the window correctly and encountering a dead loop at 
> resolveOperatorsDownWithPruning that is causing it.
> {code:scala}
>  test("agg filter contains window") {
> val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3")
>   .withColumn("test",
> expr("count(col1) filter (where min(col1) over(partition by col2 
> order by col3)>1)"))
> src.show()
>   }
> {code}
> Now my question is this kind of in agg filter (window) is the correct usage? 
> Or should I add a check like spark sql and throw an error "It is not allowed 
> to use window functions inside WHERE clause"?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45583) Spark SQL returning incorrect values for full outer join on keys with the same name.

2023-10-18 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776783#comment-17776783
 ] 

Bruce Robbins commented on SPARK-45583:
---

Strangely, I cannot reproduce. Is some setting required?
{noformat}
sql("select version()").show(false)
+--+
|version() |
+--+
|3.5.0 ce5ddad990373636e94071e7cef2f31021add07b|
+--+

scala> sql("""WITH people as (
  SELECT * FROM (VALUES 
(1, 'Peter'), 
(2, 'Homer'), 
(3, 'Ned'),
(3, 'Jenny')
  ) AS Idiots(id, FirstName)
), location as (
  SELECT * FROM (VALUES
(1, 'sample0'),
(1, 'sample1'),
(2, 'sample2')  
  ) as Locations(id, address)
)SELECT
  *
FROM
  people
FULL OUTER JOIN
  location
ON
  people.id = location.id""").show(false)
 |  |  |  |  |  |  |  |  |  |  |
  |  |  |  |  |  |  |  |  | 
+---+-++---+
|id |FirstName|id  |address|
+---+-++---+
|1  |Peter|1   |sample0|
|1  |Peter|1   |sample1|
|2  |Homer|2   |sample2|
|3  |Ned  |NULL|NULL   |
|3  |Jenny|NULL|NULL   |
+---+-++---+

scala> 
{noformat}

> Spark SQL returning incorrect values for full outer join on keys with the 
> same name.
> 
>
> Key: SPARK-45583
> URL: https://issues.apache.org/jira/browse/SPARK-45583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Huw
>Priority: Major
>
> {{The following query gives the wrong results.}}
>  
> {{WITH people as (}}
> {{  SELECT * FROM (VALUES }}
> {{    (1, 'Peter'), }}
> {{    (2, 'Homer'), }}
> {{    (3, 'Ned'),}}
> {{    (3, 'Jenny')}}
> {{  ) AS Idiots(id, FirstName)}}
> {{{}){}}}{{{}, location as ({}}}
> {{  SELECT * FROM (VALUES}}
> {{    (1, 'sample0'),}}
> {{    (1, 'sample1'),}}
> {{    (2, 'sample2')  }}
> {{  ) as Locations(id, address)}}
> {{{}){}}}{{{}SELECT{}}}
> {{  *}}
> {{FROM}}
> {{  people}}
> {{FULL OUTER JOIN}}
> {{  location}}
> {{ON}}
> {{  people.id = location.id}}
> {{We find the following table:}}
> ||id: integer||FirstName: string||id: integer||address: string||
> |2|Homer|2|sample2|
> |null|Ned|null|null|
> |null|Jenny|null|null|
> |1|Peter|1|sample0|
> |1|Peter|1|sample1|
> {{But clearly the first `id` column is wrong, the nulls should be 3.}}
> If we rename the id column in (only) the person table to pid we get the 
> correct results:
> ||pid: integer||FirstName: string||id: integer||address: string||
> |2|Homer|2|sample2|
> |3|Ned|null|null|
> |3|Jenny|null|null|
> |1|Peter|1|sample0|
> |1|Peter|1|sample1|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45580) RewritePredicateSubquery unexpectedly changes the output schema of certain queries

2023-10-17 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776401#comment-17776401
 ] 

Bruce Robbins commented on SPARK-45580:
---

I'll make a PR in the coming days.

> RewritePredicateSubquery unexpectedly changes the output schema of certain 
> queries
> --
>
> Key: SPARK-45580
> URL: https://issues.apache.org/jira/browse/SPARK-45580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> A query can have an incorrect output schema because of a subquery.
> Assume this data:
> {noformat}
> create or replace temp view t1(a) as values (1), (2), (3), (7);
> create or replace temp view t2(c1) as values (1), (2), (3);
> create or replace temp view t3(col1) as values (3), (9);
> cache table t1;
> cache table t2;
> cache table t3;
> {noformat}
> When run in {{spark-sql}}, the following query has a superfluous boolean 
> column:
> {noformat}
> select *
> from t1
> where exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> 1 false
> 2 false
> 3 true
> {noformat}
> The result should be:
> {noformat}
> 1
> 2
> 3
> {noformat}
> When executed via the {{Dataset}} API, you don't see the incorrect result, 
> because the Dataset API truncates the right-side of the rows based on the 
> analyzed plan's schema (it's the optimized plan's schema that goes wrong).
> However, even with the {{Dataset}} API, this query goes wrong:
> {noformat}
> select (
>   select *
>   from t1
>   where exists (
> select c1
> from t2
> where a = c1
> or a in (select col1 from t3)
>   )
>   limit 1
> )
> from range(1);
> java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
> something went wrong in analysis
>   at scala.Predef$.assert(Predef.scala:279)
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
>   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
>   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
> ...
> {noformat}
> Other queries that have the wrong schema:
> {noformat}
> select *
> from t1
> where a in (
>   select c1
>   from t2
>   where a in (select col1 from t3)
> );
> {noformat}
> and
> {noformat}
> select *
> from t1
> where not exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45580) RewritePredicateSubquery unexpectedly changes the output schema of certain queries

2023-10-17 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45580:
--
Description: 
A query can have an incorrect output schema because of a subquery.

Assume this data:
{noformat}
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);
cache table t1;
cache table t2;
cache table t3;
{noformat}
When run in {{spark-sql}}, the following query has a superfluous boolean column:
{noformat}
select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1   false
2   false
3   true
{noformat}
The result should be:
{noformat}
1
2
3
{noformat}
When executed via the {{Dataset}} API, you don't see the incorrect result, 
because the Dataset API truncates the right-side of the rows based on the 
analyzed plan's schema (it's the optimized plan's schema that goes wrong).

However, even with the {{Dataset}} API, this query goes wrong:
{noformat}
select (
  select *
  from t1
  where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
at scala.Predef$.assert(Predef.scala:279)
at 
org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
...
{noformat}
Other queries that have the wrong schema:
{noformat}
select *
from t1
where a in (
  select c1
  from t2
  where a in (select col1 from t3)
);
{noformat}
and
{noformat}
select *
from t1
where not exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);
{noformat}


  was:
A query can have an incorrect output schema because of a subquery.

Assume this data:
{noformat}
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);
cache table t1;
cache table t2;
cache table t3;
{noformat}
When run in {{spark-sql}}, the following query has a superfluous boolean column:
{noformat}
select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1   false
2   false
3   true
{noformat}
The result should be:
{noformat}
1
2
3
{noformat}
When executed via the {{Dataset}} API, you don't see this result, because the 
Dataset API truncates the right-side of the rows based on the analyzed plan's 
schema (it's the optimized plan's schema that goes wrong).

However, even with the {{Dataset}} API, this query goes wrong:
{noformat}
select (
  select *
  from t1
  where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
at scala.Predef$.assert(Predef.scala:279)
at 
org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
...
{noformat}
Other queries that have the wrong schema:
{noformat}
select *
from t1
where a in (
  select c1
  from t2
  where a in (select col1 from t3)
);
{noformat}
and
{noformat}
select *
from t1
where not exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);
{noformat}



> RewritePredicateSubquery unexpectedly changes the output schema of certain 
> queries
> --
>
> Key: SPARK-45580
> URL: https://issues.apache.org/jira/browse/SPARK-45580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> A query can have an incorrect output schema because of a subquery.
> Assume this data:
> {noformat}
> create or 

[jira] [Created] (SPARK-45580) RewritePredicateSubquery unexpectedly changes the output schema of certain queries

2023-10-17 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-45580:
-

 Summary: RewritePredicateSubquery unexpectedly changes the output 
schema of certain queries
 Key: SPARK-45580
 URL: https://issues.apache.org/jira/browse/SPARK-45580
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.1, 3.3.3
Reporter: Bruce Robbins


A query can have an incorrect output schema because of a subquery.

Assume this data:
{noformat}
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);
cache table t1;
cache table t2;
cache table t3;
{noformat}
When run in {{spark-sql}}, the following query has a superfluous boolean column:
{noformat}
select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1   false
2   false
3   true
{noformat}
The result should be:
{noformat}
1
2
3
{noformat}
When executed via the {{Dataset}} API, you don't see this result, because the 
Dataset API truncates the right-side of the rows based on the analyzed plan's 
schema (it's the optimized plan's schema that goes wrong).

However, even with the {{Dataset}} API, this query goes wrong:
{noformat}
select (
  select *
  from t1
  where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
at scala.Predef$.assert(Predef.scala:279)
at 
org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
...
{noformat}
Other queries that have the wrong schema:
{noformat}
select *
from t1
where a in (
  select c1
  from t2
  where a in (select col1 from t3)
);
{noformat}
and
{noformat}
select *
from t1
where not exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45440) Incorrect summary counts from a CSV file

2023-10-06 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772724#comment-17772724
 ] 

Bruce Robbins commented on SPARK-45440:
---

I added {{inferSchema=true}} as a datasource option in your example and I got 
the expected answer. Otherwise it's doing a max and min on a string (not a 
number).

> Incorrect summary counts from a CSV file
> 
>
> Key: SPARK-45440
> URL: https://issues.apache.org/jira/browse/SPARK-45440
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.5.0
> Environment: Pyspark version 3.5.0 
>Reporter: Evan Volgas
>Priority: Major
>  Labels: aggregation, bug, pyspark
>
> I am using pip-installed Pyspark version 3.5.0 inside the context of an 
> IPython shell. The task is straightforward: take [this CSV 
> file|https://gist.githubusercontent.com/evanvolgas/e5cb082673ec947239658291f2251de4/raw/a9c5e9866ac662a816f9f3828a2d184032f604f0/AAPL.csv]
>  of AAPL stock prices and compute the minimum and maximum volume weighted 
> average price for the entire file. 
> My code is [here. 
> |https://gist.github.com/evanvolgas/e4aa75fec4179bb7075a5283867f127c]I've 
> also performed the same computation in DuckDB because I noticed that the 
> results of the Spark code are wrong. 
> Literally, the exact same SQL in DuckDB and in Spark yield different results, 
> and Spark's are wrong. 
> I have never seen this behavior in a Spark release before. I'm very confused 
> by it, and curious if anyone else can replicate this behavior. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45171) GenerateExec fails to initialize non-deterministic expressions before use

2023-09-14 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-45171:
-

 Summary: GenerateExec fails to initialize non-deterministic 
expressions before use
 Key: SPARK-45171
 URL: https://issues.apache.org/jira/browse/SPARK-45171
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


The following query fails:
{noformat}
select *
from explode(
  transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22)
);
{noformat}
The error is:
{noformat}
23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized 
before eval.
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
at 
org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
at 
org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
at 
org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543)
at 
org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
at 
org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062)
at 
org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275)
at 
org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274)
at 
org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308)
at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
...
{noformat}
However, this query succeeds:
{noformat}
select *
from explode(
  sequence(0, cast(rand()*1000 as int) + 1)
);
{noformat}
The difference is that {{transform}} turns off whole-stage codegen, which 
exposes a bug in {{GenerateExec}} where the non-deterministic expression passed 
to the generator function is not initialized before being used.

An even simpler reprod case is:
{noformat}
set spark.sql.codegen.wholeStage=false;

select explode(array(rand()));
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44912) Spark 3.4 multi-column sum slows with many columns

2023-09-10 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763455#comment-17763455
 ] 

Bruce Robbins commented on SPARK-44912:
---

It looks like this was fixed with SPARK-45071. Your issue was reported earlier, 
but missed somehow.

> Spark 3.4 multi-column sum slows with many columns
> --
>
> Key: SPARK-44912
> URL: https://issues.apache.org/jira/browse/SPARK-44912
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Brady Bickel
>Priority: Major
>
> The code below is a minimal reproducible example of an issue I discovered 
> with Pyspark 3.4.x. I want to sum the values of multiple columns and put the 
> sum of those columns (per row) into a new column. This code works and returns 
> in a reasonable amount of time in Pyspark 3.3.x, but is extremely slow in 
> Pyspark 3.4.x when the number of columns grows. See below for execution 
> timing summary as N varies.
> {code:java}
> import pyspark.sql.functions as F
> import random
> import string
> from functools import reduce
> from operator import add
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> # generate a dataframe N columns by M rows with random 8 digit column 
> # names and random integers in [-5,10]
> N = 30
> M = 100
> columns = [''.join(random.choices(string.ascii_uppercase +
>   string.digits, k=8))
>for _ in range(N)]
> data = [tuple([random.randint(-5,10) for _ in range(N)])
> for _ in range(M)]
> df = spark.sparkContext.parallelize(data).toDF(columns)
> # 3 ways to add a sum column, all of them slow for high N in spark 3.4
> df = df.withColumn("col_sum1", sum(df[col] for col in columns))
> df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns]))
> df = df.withColumn("col_sum3", F.expr("+".join(columns))) {code}
> Timing results for Spark 3.3:
> ||N||Exe Time (s)||
> |5|0.514|
> |10|0.248|
> |15|0.327|
> |20|0.403|
> |25|0.279|
> |30|0.322|
> |50|0.430|
> Timing results for Spark 3.4:
> ||N||Exe Time (s)||
> |5|0.379|
> |10|0.318|
> |15|0.405|
> |20|1.32|
> |25|28.8|
> |30|448|
> |50|>1 (did not finish)|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45106) percentile_cont gets internal error when user input fails runtime replacement's input type check

2023-09-08 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45106:
--
Affects Version/s: 3.3.2

>  percentile_cont gets internal error when user input fails runtime 
> replacement's input type check
> -
>
> Key: SPARK-45106
> URL: https://issues.apache.org/jira/browse/SPARK-45106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> This query throws an internal error rather than producing a useful error 
> message:
> {noformat}
> select percentile_cont(b) WITHIN GROUP (ORDER BY a DESC) as x 
> from (values (12, 0.25), (13, 0.25), (22, 0.25)) as (a, b);
> [INTERNAL_ERROR] Cannot resolve the runtime replaceable expression 
> "percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)".
> org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot resolve the runtime 
> replaceable expression "percentile_cont(a, b)". The replacement is 
> unresolved: "percentile(a, b, 1)".
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:92)
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:96)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:313)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:277)
> ...
> {noformat}
> It should instead inform the user that the input expression must be foldable.
> {{PercentileCont}} does not check the user's input. If the runtime 
> replacement (an instance of {{Percentile}}) rejects the user's input, the 
> runtime replacement ends up unresolved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45106) percentile_cont gets internal error when user input fails runtime replacement's input type check

2023-09-08 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-45106:
-

 Summary:  percentile_cont gets internal error when user input 
fails runtime replacement's input type check
 Key: SPARK-45106
 URL: https://issues.apache.org/jira/browse/SPARK-45106
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1, 3.5.0, 4.0.0
Reporter: Bruce Robbins


This query throws an internal error rather than producing a useful error 
message:
{noformat}
select percentile_cont(b) WITHIN GROUP (ORDER BY a DESC) as x 
from (values (12, 0.25), (13, 0.25), (22, 0.25)) as (a, b);

[INTERNAL_ERROR] Cannot resolve the runtime replaceable expression 
"percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)".
org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot resolve the runtime 
replaceable expression "percentile_cont(a, b)". The replacement is unresolved: 
"percentile(a, b, 1)".
at 
org.apache.spark.SparkException$.internalError(SparkException.scala:92)
at 
org.apache.spark.SparkException$.internalError(SparkException.scala:96)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:313)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:277)
...
{noformat}
It should instead inform the user that the input expression must be foldable.

{{PercentileCont}} does not check the user's input. If the runtime replacement 
(an instance of {{Percentile}}) rejects the user's input, the runtime 
replacement ends up unresolved.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-09-07 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44805:
--
Affects Version/s: 3.4.1

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.4.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>  Labels: correctness
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-09-07 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762792#comment-17762792
 ] 

Bruce Robbins commented on SPARK-44805:
---

PR here: https://github.com/apache/spark/pull/42850

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>  Labels: correctness
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-09-05 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762234#comment-17762234
 ] 

Bruce Robbins commented on SPARK-44805:
---

I looked at this yesterday and I think I have a handle on what's going on. I 
will make a PR in the coming days.

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>  Labels: correctness
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-09-04 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44805:
--
Labels: correctness  (was: )

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>  Labels: correctness
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-08-14 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754344#comment-17754344
 ] 

Bruce Robbins edited comment on SPARK-44805 at 8/15/23 12:26 AM:
-

[~sunchao] 

It seems to be some weird interaction between Parquet nested vectorization and 
the {{Cast}} expression:
{noformat}
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select value from t1;
{"f1":[1,2,3],"f2":[1,1,2]} <== this is expected
Time taken: 0.126 seconds, Fetched 1 row(s)

select cast(value as struct,f2:array>) AS value from t1;
{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}   <== this is not expected
Time taken: 0.102 seconds, Fetched 1 row(s)

set spark.sql.parquet.enableNestedColumnVectorizedReader=false;

select cast(value as struct,f2:array>) AS value from t1;
{"f1":[1.0,2.0,3.0],"f2":[1,1,2]}   <== now has expected value
Time taken: 0.244 seconds, Fetched 1 row(s)
{noformat}
The union operation adds this {{Cast}} expression because {{value}} has 
different datatypes between your two dataframes.


was (Author: bersprockets):
It seems to be some weird interaction between Parquet and the {{Cast}} 
expression:
{noformat}
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select value from t1;
{"f1":[1,2,3],"f2":[1,1,2]} <== this is expected
Time taken: 0.126 seconds, Fetched 1 row(s)

select cast(value as struct,f2:array>) AS value from t1;
{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}   <== this is not expected
Time taken: 0.102 seconds, Fetched 1 row(s)
{noformat}
The union operation adds this {{Cast}} expression because {{value}} has 
different datatypes between your two dataframes.

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best 

[jira] [Commented] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-08-14 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754344#comment-17754344
 ] 

Bruce Robbins commented on SPARK-44805:
---

It seems to be some weird interaction between Parquet and the {{Cast}} 
expression:
{noformat}
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select value from t1;
{"f1":[1,2,3],"f2":[1,1,2]} <== this is expected
Time taken: 0.126 seconds, Fetched 1 row(s)

select cast(value as struct,f2:array>) AS value from t1;
{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}   <== this is not expected
Time taken: 0.102 seconds, Fetched 1 row(s)
{noformat}
The union operation adds this {{Cast}} expression because {{value}} has 
different datatypes between your two dataframes.

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44477) CheckAnalysis uses error subclass as an error class

2023-07-18 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744314#comment-17744314
 ] 

Bruce Robbins commented on SPARK-44477:
---

PR here: https://github.com/apache/spark/pull/42064

> CheckAnalysis uses error subclass as an error class
> ---
>
> Key: SPARK-44477
> URL: https://issues.apache.org/jira/browse/SPARK-44477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> {{CheckAnalysis}} treats {{TYPE_CHECK_FAILURE_WITH_HINT}} as an error class, 
> but it is instead an error subclass of {{{}DATATYPE_MISMATCH{}}}.
> {noformat}
> spark-sql (default)> select bitmap_count(12);
> [INTERNAL_ERROR] Cannot find main error class 'TYPE_CHECK_FAILURE_WITH_HINT'
> org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot find main error 
> class 'TYPE_CHECK_FAILURE_WITH_HINT'
> at org.apache.spark.SparkException$.internalError(SparkException.scala:83)
> at org.apache.spark.SparkException$.internalError(SparkException.scala:87)
> at 
> org.apache.spark.ErrorClassesJsonReader.$anonfun$getMessageTemplate$1(ErrorClassesJSONReader.scala:68)
> at scala.collection.immutable.HashMap$HashMap1.getOrElse0(HashMap.scala:361)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:594)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:589)
> at scala.collection.immutable.HashMap.getOrElse(HashMap.scala:73)
> {noformat}
> This issue only occurs when an expression uses 
> {{TypeCheckResult.TypeCheckFailure}} to indicate input type check failure. 
> {{TypeCheckResult.TypeCheckFailure}} appears to be deprecated in favor of 
> {{{}TypeCheckResult.DataTypeMismatch{}}}, but recently two expressions were 
> added that use {{{}TypeCheckResult.TypeCheckFailure{}}}: {{BitmapCount}} and 
> {{{}BitmapOrAgg{}}}.
> {{BitmapCount}} and {{BitmapOrAgg}} should probably be fixed to use 
> {{{}TypeCheckResult.DataTypeMismatch{}}}. Regardless, the code in 
> {{CheckAnalysis}} that handles {{TypeCheckResult.TypeCheckFailure}} should be 
> corrected (or removed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44477) CheckAnalysis uses error subclass as an error class

2023-07-18 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-44477:
-

 Summary: CheckAnalysis uses error subclass as an error class
 Key: SPARK-44477
 URL: https://issues.apache.org/jira/browse/SPARK-44477
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


{{CheckAnalysis}} treats {{TYPE_CHECK_FAILURE_WITH_HINT}} as an error class, 
but it is instead an error subclass of {{{}DATATYPE_MISMATCH{}}}.
{noformat}
spark-sql (default)> select bitmap_count(12);
[INTERNAL_ERROR] Cannot find main error class 'TYPE_CHECK_FAILURE_WITH_HINT'
org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot find main error class 
'TYPE_CHECK_FAILURE_WITH_HINT'
at org.apache.spark.SparkException$.internalError(SparkException.scala:83)
at org.apache.spark.SparkException$.internalError(SparkException.scala:87)
at 
org.apache.spark.ErrorClassesJsonReader.$anonfun$getMessageTemplate$1(ErrorClassesJSONReader.scala:68)
at scala.collection.immutable.HashMap$HashMap1.getOrElse0(HashMap.scala:361)
at scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:594)
at scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:589)
at scala.collection.immutable.HashMap.getOrElse(HashMap.scala:73)
{noformat}
This issue only occurs when an expression uses 
{{TypeCheckResult.TypeCheckFailure}} to indicate input type check failure. 
{{TypeCheckResult.TypeCheckFailure}} appears to be deprecated in favor of 
{{{}TypeCheckResult.DataTypeMismatch{}}}, but recently two expressions were 
added that use {{{}TypeCheckResult.TypeCheckFailure{}}}: {{BitmapCount}} and 
{{{}BitmapOrAgg{}}}.

{{BitmapCount}} and {{BitmapOrAgg}} should probably be fixed to use 
{{{}TypeCheckResult.DataTypeMismatch{}}}. Regardless, the code in 
{{CheckAnalysis}} that handles {{TypeCheckResult.TypeCheckFailure}} should be 
corrected (or removed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-07-01 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44251:
--
Labels: correctness  (was: )

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-06-30 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44251:
--
Affects Version/s: 3.3.2

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-06-30 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44251:
--
Affects Version/s: 3.4.1

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-06-30 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17739180#comment-17739180
 ] 

Bruce Robbins commented on SPARK-44251:
---

PR can be found here: https://github.com/apache/spark/pull/41809

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-06-29 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738762#comment-17738762
 ] 

Bruce Robbins commented on SPARK-44251:
---

This is similar to, but not quite the same as SPARK-43718, and the fix will be 
similar too.

I will make a PR shortly.
 

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-06-29 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44251:
--
Summary: Potential for incorrect results or NPE when full outer USING join 
has null key value  (was: Potentially incorrect results or NPE when full outer 
USING join has null key value)

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44251) Potentially incorrect results or NPE when full outer USING join has null key value

2023-06-29 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-44251:
-

 Summary: Potentially incorrect results or NPE when full outer 
USING join has null key value
 Key: SPARK-44251
 URL: https://issues.apache.org/jira/browse/SPARK-44251
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


The following query produces incorrect results:
{noformat}
create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
create or replace temp view v2 as values (2, 3) as (c1, c2);

select explode(array(c1)) as x
from v1
full outer join v2
using (c1);

-1   <== should be null
1
2
{noformat}
The following query fails with a {{NullPointerException}}:
{noformat}
create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
create or replace temp view v2 as values ('2', 3) as (c1, c2);

select explode(array(c1)) as x
from v1
full outer join v2
using (c1);

23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
...
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735976#comment-17735976
 ] 

Bruce Robbins commented on SPARK-44132:
---

[~steven.aerts] Go for it!

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735944#comment-17735944
 ] 

Bruce Robbins edited comment on SPARK-44132 at 6/22/23 1:51 AM:


You may have this figured out already, but in case not, here's a clue.

You can replicate the NPE in {{spark-shell}} as follows:
{noformat}
val dsA = Seq((1, 1)).toDF("id", "a")
val dsB = Seq((2, 2)).toDF("id", "a")
val dsC = Seq((3, 3)).toDF("id", "a")

val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), 
"full_outer");
joined.collectAsList
{noformat}

I think its because the join column sequence {{idSeq}} (in your unit test) is 
provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream:
{noformat}
scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter(
Collections.singletonList("id")
).asScala.toSeq;
 |  | res2: Seq[String] = Stream(id, ?)

scala> 
{noformat}
This seems to a bug in the handling of the join columns, but only in the case 
where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, 
SPARK-38221, SPARK-26680).


was (Author: bersprockets):
You may have this figured out already, but in case not, here's a clue.

You can replicate the NPE in {{spark-shell}} as follows:
{noformat}
val dsA = Seq((1, 1)).toDF("id", "a")
val dsB = Seq((2, 2)).toDF("id", "a")
val dsC = Seq((3, 3)).toDF("id", "a")

val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), 
"full_outer");
joined.collectAsList
{noformat}

I think its because the join column sequence {{idSeq}} (in your unit test) is 
provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream:
{noformat}
scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter(
Collections.singletonList("id")
).asScala.toSeq;
 |  | res2: Seq[String] = Stream(id, ?)

scala> 
{noformat}
This seems to a bug in the handling of the join columns, but only in the case 
where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, 
SPARK-38221).

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark 

[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735944#comment-17735944
 ] 

Bruce Robbins commented on SPARK-44132:
---

You may have this figured out already, but in case not, here's a clue.

You can replicate the NPE in {{spark-shell}} as follows:
{noformat}
val dsA = Seq((1, 1)).toDF("id", "a")
val dsB = Seq((2, 2)).toDF("id", "a")
val dsC = Seq((3, 3)).toDF("id", "a")

val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), 
"full_outer");
joined.collectAsList
{noformat}

I think its because the join column sequence {{idSeq}} (in your unit test) is 
provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream:
{noformat}
scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter(
Collections.singletonList("id")
).asScala.toSeq;
 |  | res2: Seq[String] = Stream(id, ?)

scala> 
{noformat}
This seems to a bug in the handling of the join columns, but only in the case 
where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, 
SPARK-38221).

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44040) Incorrect result after count distinct

2023-06-13 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732163#comment-17732163
 ] 

Bruce Robbins commented on SPARK-44040:
---

It seems this can be reproduced in {{spark-sql}} as well.

Interestingly, turning off AQE seems to fix the issue (for both the above 
dataframe version and the below SQL version):
{noformat}
spark-sql (default)> create or replace temp view v1 as
select 1 as c1 limit 0;
Time taken: 0.959 seconds
spark-sql (default)> create or replace temp view agg1 as
select sum(c1) as c1, "agg1" as name
from v1;
Time taken: 0.16 seconds
spark-sql (default)> create or replace temp view agg2 as
select sum(c1) as c1, "agg2" as name
from v1;
Time taken: 0.035 seconds
spark-sql (default)> create or replace temp view union1 as
select * from agg1
union
select * from agg2;
Time taken: 0.088 seconds
spark-sql (default)> -- the following incorrectly produces 2 rows
select distinct c1 from union1;
NULL
NULL
Time taken: 1.649 seconds, Fetched 2 row(s)
spark-sql (default)> set spark.sql.adaptive.enabled=false;
spark.sql.adaptive.enabled  false
Time taken: 0.019 seconds, Fetched 1 row(s)
spark-sql (default)> -- the following correctly produces 1 row
select distinct c1 from union1;
NULL
Time taken: 1.372 seconds, Fetched 1 row(s)
spark-sql (default)> 
{noformat}

> Incorrect result after count distinct
> -
>
> Key: SPARK-44040
> URL: https://issues.apache.org/jira/browse/SPARK-44040
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Aleksandr Aleksandrov
>Priority: Critical
>
> When i try to call count after distinct function for Decimal null field, 
> spark return incorrect result starting from spark 3.4.0.
> A minimal example to reproduce:
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.\{Column, DataFrame, Dataset, Row, SparkSession}
> import org.apache.spark.sql.types.\{StringType, StructField, StructType}
> val schema = StructType( Array(
> StructField("money", DecimalType(38,6), true),
> StructField("reference_id", StringType, true)
> ))
> val payDf = spark.createDataFrame(sc.emptyRDD[Row], schema)
> val aggDf = payDf.agg(sum("money").as("money")).withColumn("name", lit("df1"))
> val aggDf1 = payDf.agg(sum("money").as("money")).withColumn("name", 
> lit("df2"))
> val unionDF: DataFrame = aggDf.union(aggDf1)
> unionDF.select("money").distinct.show // return correct result
> unionDF.select("money").distinct.count // return 2 instead of 1
> unionDF.select("money").distinct.count == 1 // return false
> This block of code returns some assertion error and after that an incorrect 
> count (in spark 3.2.1 everything works fine and i get correct result = 1):
> *scala> unionDF.select("money").distinct.show // return correct result*
> java.lang.AssertionError: assertion failed:
> Decimal$DecimalIsFractional
> while compiling: 
> during phase: globalPhase=terminal, enteringPhase=jvm
> library version: version 2.12.17
> compiler version: version 2.12.17
> reconstructed args: -classpath 
> /Users/aleksandrov/.ivy2/jars/org.apache.spark_spark-connect_2.12-3.4.0.jar:/Users/aleksandrov/.ivy2/jars/io.delta_delta-core_2.12-2.4.0.jar:/Users/aleksandrov/.ivy2/jars/io.delta_delta-storage-2.4.0.jar:/Users/aleksandrov/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar:/Users/aleksandrov/.ivy2/jars/org.antlr_antlr4-runtime-4.9.3.jar
>  -Yrepl-class-based -Yrepl-outdir 
> /private/var/folders/qj/_dn4xbp14jn37qmdk7ylyfwcgr/T/spark-f37bb154-75f3-4db7-aea8-3c4363377bd8/repl-350f37a1-1df1-4816-bd62-97929c60a6c1
> last tree to typer: TypeTree(class Byte)
> tree position: line 6 of 
> tree tpe: Byte
> symbol: (final abstract) class Byte in package scala
> symbol definition: final abstract class Byte extends (a ClassSymbol)
> symbol package: scala
> symbol owners: class Byte
> call site: constructor $eval in object $eval in package $line19
> == Source file context for tree position ==
> 3
> 4object $eval {
> 5lazyval $result = 
> $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0
> 6lazyval $print: {_}root{_}.java.lang.String = {
> 7 $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw
> 8
> 9""
> at 
> scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185)
> at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525)
> at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514)
> at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353)
> at 
> scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346)
> at 
> scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348)
> at 
> scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487)
> at 
> 

[jira] [Resolved] (SPARK-43843) Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError

2023-05-28 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-43843.
---
Resolution: Invalid

> Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError
> ---
>
> Key: SPARK-43843
> URL: https://issues.apache.org/jira/browse/SPARK-43843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: Scala version 2.13.8 (Java HotSpot(TM) 64-Bit Server VM, 
> Java 11.0.12)
>Reporter: Bruce Robbins
>Priority: Major
>
> I launched spark-shell as so:
> {noformat}
> bin/spark-shell --driver-memory 8g --jars `find . -name "spark-avro*.jar" | 
> grep -v test | head -1`
> {noformat}
> I got the below error trying to create an AVRO file:
> {noformat}
> scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
> val df: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> df.write.mode("overwrite").format("avro").save("avro_file")
> df.write.mode("overwrite").format("avro").save("avro_file")
> java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps
>   at 
> org.apache.spark.sql.avro.AvroFileFormat.supportFieldName(AvroFileFormat.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1(DataSourceUtils.scala:75)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1$adapted(DataSourceUtils.scala:74)
>   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
>   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.checkFieldNames(DataSourceUtils.scala:74)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:120)
> ...
> scala> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43843) Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError

2023-05-28 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17726988#comment-17726988
 ] 

Bruce Robbins commented on SPARK-43843:
---

Nevermind, I had an old {{spark-avro_2.12-3.5.0-SNAPSHOT.jar}} laying about in 
my {{work}} directory which the find in my {{--jars}} value found first.

> Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError
> ---
>
> Key: SPARK-43843
> URL: https://issues.apache.org/jira/browse/SPARK-43843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: Scala version 2.13.8 (Java HotSpot(TM) 64-Bit Server VM, 
> Java 11.0.12)
>Reporter: Bruce Robbins
>Priority: Major
>
> I launched spark-shell as so:
> {noformat}
> bin/spark-shell --driver-memory 8g --jars `find . -name "spark-avro*.jar" | 
> grep -v test | head -1`
> {noformat}
> I got the below error trying to create an AVRO file:
> {noformat}
> scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
> val df: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> df.write.mode("overwrite").format("avro").save("avro_file")
> df.write.mode("overwrite").format("avro").save("avro_file")
> java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps
>   at 
> org.apache.spark.sql.avro.AvroFileFormat.supportFieldName(AvroFileFormat.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1(DataSourceUtils.scala:75)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1$adapted(DataSourceUtils.scala:74)
>   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
>   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.checkFieldNames(DataSourceUtils.scala:74)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:120)
> ...
> scala> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43841) Non-existent column in projection of full outer join with USING results in StringIndexOutOfBoundsException

2023-05-28 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17726980#comment-17726980
 ] 

Bruce Robbins commented on SPARK-43841:
---

PR at https://github.com/apache/spark/pull/41353

> Non-existent column in projection of full outer join with USING results in 
> StringIndexOutOfBoundsException
> --
>
> Key: SPARK-43841
> URL: https://issues.apache.org/jira/browse/SPARK-43841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> The following query throws a {{StringIndexOutOfBoundsException}}:
> {noformat}
> with v1 as (
>  select * from values (1, 2) as (c1, c2)
> ),
> v2 as (
>   select * from values (2, 3) as (c1, c2)
> )
> select v1.c1, v1.c2, v2.c1, v2.c2, b
> from v1
> full outer join v2
> using (c1);
> {noformat}
> The query should fail anyway, since {{b}} refers to a non-existent column. 
> But it should fail with a helpful error message, not with a 
> {{StringIndexOutOfBoundsException}}.
> The issue seems to be in 
> {{StringUtils#orderSuggestedIdentifiersBySimilarity}}. 
> {{orderSuggestedIdentifiersBySimilarity}} assumes that a list of candidate 
> attributes with a mix of prefixes will never have an attribute name with an 
> empty prefix. But in this case it does ({{c1}} from the {{coalesce}} has no 
> prefix, since it is not associated with any relation or subquery):
> {noformat}
> +- 'Project [c1#5, c2#6, c1#7, c2#8, 'b]
>+- Project [coalesce(c1#5, c1#7) AS c1#9, c2#6, c2#8] <== c1#9 has no 
> prefix, unlike c2#6 (v1.c2) or c2#8 (v2.c2)
>   +- Join FullOuter, (c1#5 = c1#7)
>  :- SubqueryAlias v1
>  :  +- CTERelationRef 0, true, [c1#5, c2#6]
>  +- SubqueryAlias v2
> +- CTERelationRef 1, true, [c1#7, c2#8]
> {noformat}
> Because of this, {{orderSuggestedIdentifiersBySimilarity}} returns a sorted 
> list of suggestions like this:
> {noformat}
> ArrayBuffer(.c1, v1.c2, v2.c2)
> {noformat}
> {{UnresolvedAttribute.parseAttributeName}} chokes on an attribute name that 
> starts with a namespace separator ('.').



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43843) Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError

2023-05-28 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-43843:
--
Environment: Scala version 2.13.8 (Java HotSpot(TM) 64-Bit Server VM, Java 
11.0.12)

> Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError
> ---
>
> Key: SPARK-43843
> URL: https://issues.apache.org/jira/browse/SPARK-43843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: Scala version 2.13.8 (Java HotSpot(TM) 64-Bit Server VM, 
> Java 11.0.12)
>Reporter: Bruce Robbins
>Priority: Major
>
> I launched spark-shell as so:
> {noformat}
> bin/spark-shell --driver-memory 8g --jars `find . -name "spark-avro*.jar" | 
> grep -v test | head -1`
> {noformat}
> I got the below error trying to create an AVRO file:
> {noformat}
> scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
> val df: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> df.write.mode("overwrite").format("avro").save("avro_file")
> df.write.mode("overwrite").format("avro").save("avro_file")
> java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps
>   at 
> org.apache.spark.sql.avro.AvroFileFormat.supportFieldName(AvroFileFormat.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1(DataSourceUtils.scala:75)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1$adapted(DataSourceUtils.scala:74)
>   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
>   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.checkFieldNames(DataSourceUtils.scala:74)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:120)
> ...
> scala> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43843) Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError

2023-05-28 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-43843:
-

 Summary: Saving an AVRO file with Scala 2.13 results in 
NoClassDefFoundError
 Key: SPARK-43843
 URL: https://issues.apache.org/jira/browse/SPARK-43843
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


I launched spark-shell as so:
{noformat}
bin/spark-shell --driver-memory 8g --jars `find . -name "spark-avro*.jar" | 
grep -v test | head -1`
{noformat}
I got the below error trying to create an AVRO file:
{noformat}
scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
val df = Seq((1, 2), (3, 4)).toDF("a", "b")
val df: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> df.write.mode("overwrite").format("avro").save("avro_file")
df.write.mode("overwrite").format("avro").save("avro_file")
java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps
  at 
org.apache.spark.sql.avro.AvroFileFormat.supportFieldName(AvroFileFormat.scala:160)
  at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1(DataSourceUtils.scala:75)
  at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1$adapted(DataSourceUtils.scala:74)
  at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
  at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105)
  at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.checkFieldNames(DataSourceUtils.scala:74)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:120)
...
scala> 
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43841) Non-existent column in projection of full outer join with USING results in StringIndexOutOfBoundsException

2023-05-28 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-43841:
-

 Summary: Non-existent column in projection of full outer join with 
USING results in StringIndexOutOfBoundsException
 Key: SPARK-43841
 URL: https://issues.apache.org/jira/browse/SPARK-43841
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


The following query throws a {{StringIndexOutOfBoundsException}}:
{noformat}
with v1 as (
 select * from values (1, 2) as (c1, c2)
),
v2 as (
  select * from values (2, 3) as (c1, c2)
)
select v1.c1, v1.c2, v2.c1, v2.c2, b
from v1
full outer join v2
using (c1);
{noformat}
The query should fail anyway, since {{b}} refers to a non-existent column. But 
it should fail with a helpful error message, not with a 
{{StringIndexOutOfBoundsException}}.

The issue seems to be in {{StringUtils#orderSuggestedIdentifiersBySimilarity}}. 
{{orderSuggestedIdentifiersBySimilarity}} assumes that a list of candidate 
attributes with a mix of prefixes will never have an attribute name with an 
empty prefix. But in this case it does ({{c1}} from the {{coalesce}} has no 
prefix, since it is not associated with any relation or subquery):
{noformat}
+- 'Project [c1#5, c2#6, c1#7, c2#8, 'b]
   +- Project [coalesce(c1#5, c1#7) AS c1#9, c2#6, c2#8] <== c1#9 has no 
prefix, unlike c2#6 (v1.c2) or c2#8 (v2.c2)
  +- Join FullOuter, (c1#5 = c1#7)
 :- SubqueryAlias v1
 :  +- CTERelationRef 0, true, [c1#5, c2#6]
 +- SubqueryAlias v2
+- CTERelationRef 1, true, [c1#7, c2#8]
{noformat}
Because of this, {{orderSuggestedIdentifiersBySimilarity}} returns a sorted 
list of suggestions like this:
{noformat}
ArrayBuffer(.c1, v1.c2, v2.c2)
{noformat}
{{UnresolvedAttribute.parseAttributeName}} chokes on an attribute name that 
starts with a namespace separator ('.').




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability

2023-05-22 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725143#comment-17725143
 ] 

Bruce Robbins commented on SPARK-43718:
---

PR here: https://github.com/apache/spark/pull/41267

> References to a specific side's key in a USING join can have wrong nullability
> --
>
> Key: SPARK-43718
> URL: https://issues.apache.org/jira/browse/SPARK-43718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view t1 as values (1), (2), (3) as (c1);
> create or replace temp view t2 as values (2), (3), (4) as (c1);
> {noformat}
> The following query produces incorrect results:
> {noformat}
> spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
> from t1
> full outer join t2
> using (c1);
> 1
> -1  <== should be null
> 2
> 2
> 3
> 3
> -1  <== should be null
> 4
> Time taken: 0.663 seconds, Fetched 8 row(s)
> spark-sql (default)> 
> {noformat}
> Similar issues occur with right outer join and left outer join.
> {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
> resolved, so the array's {{containsNull}} value is incorrect.
> Queries that don't use arrays also can get wrong results. Assume this data:
> {noformat}
> create or replace temp view t1 as values (0), (1), (2) as (c1);
> create or replace temp view t2 as values (1), (2), (3) as (c1);
> create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b);
> {noformat}
> The following query produces incorrect results:
> {noformat}
> select t1.c1 as t1_c1, t2.c1 as t2_c1, b
> from t1
> full outer join t2
> using (c1),
> lateral (
>   select b
>   from t3
>   where a = coalesce(t2.c1, 1)
> ) lt3;
> 1 1   2
> NULL  3   4
> Time taken: 2.395 seconds, Fetched 2 row(s)
> spark-sql (default)> 
> {noformat}
> The result should be the following:
> {noformat}
> 0 NULL2
> 1 1   2
> NULL  3   4
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability

2023-05-22 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-43718:
--
Description: 
Assume this data:
{noformat}
create or replace temp view t1 as values (1), (2), (3) as (c1);
create or replace temp view t2 as values (2), (3), (4) as (c1);
{noformat}
The following query produces incorrect results:
{noformat}
spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
from t1
full outer join t2
using (c1);
1
-1  <== should be null
2
2
3
3
-1  <== should be null
4
Time taken: 0.663 seconds, Fetched 8 row(s)
spark-sql (default)> 
{noformat}
Similar issues occur with right outer join and left outer join.

{{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
resolved, so the array's {{containsNull}} value is incorrect.

Queries that don't use arrays also can get wrong results. Assume this data:
{noformat}
create or replace temp view t1 as values (0), (1), (2) as (c1);
create or replace temp view t2 as values (1), (2), (3) as (c1);
create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b);
{noformat}
The following query produces incorrect results:
{noformat}
select t1.c1 as t1_c1, t2.c1 as t2_c1, b
from t1
full outer join t2
using (c1),
lateral (
  select b
  from t3
  where a = coalesce(t2.c1, 1)
) lt3;
1   1   2
NULL3   4
Time taken: 2.395 seconds, Fetched 2 row(s)
spark-sql (default)> 
{noformat}
The result should be the following:
{noformat}
0   NULL2
1   1   2
NULL3   4
{noformat}



  was:
Assume this data:
{noformat}
create or replace temp view t1 as values (1), (2), (3) as (c1);
create or replace temp view t2 as values (2), (3), (4) as (c1);
{noformat}
The following query produces incorrect results:
{noformat}
spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
from t1
full outer join t2
using (c1);
1
-1  <== should be null
2
2
3
3
-1  <== should be null
4
Time taken: 0.663 seconds, Fetched 8 row(s)
spark-sql (default)> 
{noformat}
Similar issues occur with right outer join and left outer join.

{{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
resolved, so the array's {{containsNull}} value is incorrect.


> References to a specific side's key in a USING join can have wrong nullability
> --
>
> Key: SPARK-43718
> URL: https://issues.apache.org/jira/browse/SPARK-43718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view t1 as values (1), (2), (3) as (c1);
> create or replace temp view t2 as values (2), (3), (4) as (c1);
> {noformat}
> The following query produces incorrect results:
> {noformat}
> spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
> from t1
> full outer join t2
> using (c1);
> 1
> -1  <== should be null
> 2
> 2
> 3
> 3
> -1  <== should be null
> 4
> Time taken: 0.663 seconds, Fetched 8 row(s)
> spark-sql (default)> 
> {noformat}
> Similar issues occur with right outer join and left outer join.
> {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
> resolved, so the array's {{containsNull}} value is incorrect.
> Queries that don't use arrays also can get wrong results. Assume this data:
> {noformat}
> create or replace temp view t1 as values (0), (1), (2) as (c1);
> create or replace temp view t2 as values (1), (2), (3) as (c1);
> create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b);
> {noformat}
> The following query produces incorrect results:
> {noformat}
> select t1.c1 as t1_c1, t2.c1 as t2_c1, b
> from t1
> full outer join t2
> using (c1),
> lateral (
>   select b
>   from t3
>   where a = coalesce(t2.c1, 1)
> ) lt3;
> 1 1   2
> NULL  3   4
> Time taken: 2.395 seconds, Fetched 2 row(s)
> spark-sql (default)> 
> {noformat}
> The result should be the following:
> {noformat}
> 0 NULL2
> 1 1   2
> NULL  3   4
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability

2023-05-22 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-43718:
--
Affects Version/s: 3.3.2

> References to a specific side's key in a USING join can have wrong nullability
> --
>
> Key: SPARK-43718
> URL: https://issues.apache.org/jira/browse/SPARK-43718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view t1 as values (1), (2), (3) as (c1);
> create or replace temp view t2 as values (2), (3), (4) as (c1);
> {noformat}
> The following query produces incorrect results:
> {noformat}
> spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
> from t1
> full outer join t2
> using (c1);
> 1
> -1  <== should be null
> 2
> 2
> 3
> 3
> -1  <== should be null
> 4
> Time taken: 0.663 seconds, Fetched 8 row(s)
> spark-sql (default)> 
> {noformat}
> Similar issues occur with right outer join and left outer join.
> {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
> resolved, so the array's {{containsNull}} value is incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability

2023-05-22 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-43718:
--
Affects Version/s: 3.4.0

> References to a specific side's key in a USING join can have wrong nullability
> --
>
> Key: SPARK-43718
> URL: https://issues.apache.org/jira/browse/SPARK-43718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view t1 as values (1), (2), (3) as (c1);
> create or replace temp view t2 as values (2), (3), (4) as (c1);
> {noformat}
> The following query produces incorrect results:
> {noformat}
> spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
> from t1
> full outer join t2
> using (c1);
> 1
> -1  <== should be null
> 2
> 2
> 3
> 3
> -1  <== should be null
> 4
> Time taken: 0.663 seconds, Fetched 8 row(s)
> spark-sql (default)> 
> {noformat}
> Similar issues occur with right outer join and left outer join.
> {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
> resolved, so the array's {{containsNull}} value is incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability

2023-05-22 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725122#comment-17725122
 ] 

Bruce Robbins commented on SPARK-43718:
---

I think I have a handle on this. I will submit in a PR in the coming days.

> References to a specific side's key in a USING join can have wrong nullability
> --
>
> Key: SPARK-43718
> URL: https://issues.apache.org/jira/browse/SPARK-43718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view t1 as values (1), (2), (3) as (c1);
> create or replace temp view t2 as values (2), (3), (4) as (c1);
> {noformat}
> The following query produces incorrect results:
> {noformat}
> spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
> from t1
> full outer join t2
> using (c1);
> 1
> -1  <== should be null
> 2
> 2
> 3
> 3
> -1  <== should be null
> 4
> Time taken: 0.663 seconds, Fetched 8 row(s)
> spark-sql (default)> 
> {noformat}
> Similar issues occur with right outer join and left outer join.
> {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
> resolved, so the array's {{containsNull}} value is incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability

2023-05-22 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-43718:
--
Labels: correctness  (was: )

> References to a specific side's key in a USING join can have wrong nullability
> --
>
> Key: SPARK-43718
> URL: https://issues.apache.org/jira/browse/SPARK-43718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view t1 as values (1), (2), (3) as (c1);
> create or replace temp view t2 as values (2), (3), (4) as (c1);
> {noformat}
> The following query produces incorrect results:
> {noformat}
> spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
> from t1
> full outer join t2
> using (c1);
> 1
> -1  <== should be null
> 2
> 2
> 3
> 3
> -1  <== should be null
> 4
> Time taken: 0.663 seconds, Fetched 8 row(s)
> spark-sql (default)> 
> {noformat}
> Similar issues occur with right outer join and left outer join.
> {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
> resolved, so the array's {{containsNull}} value is incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability

2023-05-22 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-43718:
--
Description: 
Assume this data:
{noformat}
create or replace temp view t1 as values (1), (2), (3) as (c1);
create or replace temp view t2 as values (2), (3), (4) as (c1);
{noformat}
The following query produces incorrect results:
{noformat}
spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
from t1
full outer join t2
using (c1);
1
-1  <== should be null
2
2
3
3
-1  <== should be null
4
Time taken: 0.663 seconds, Fetched 8 row(s)
spark-sql (default)> 
{noformat}
Similar issues occur with right outer join and left outer join.

{{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
resolved, so the array's {{containsNull}} value is incorrect.

  was:
Assume this data:
{noformat}
create or replace temp view t1 as values (1), (2), (3) as (c1);
create or replace temp view t2 as values (2), (3), (4) as (c1);
{noformat}
The following query produces the wrong result:
{noformat}
spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
from t1
full outer join t2
using (c1);
1
-1  <== should be null
2
2
3
3
-1  <== should be null
4
Time taken: 0.663 seconds, Fetched 8 row(s)
spark-sql (default)> 
{noformat}
Similar issues occur with right outer join and left outer join.

{{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
resolved, so the array's {{containsNull}} value is incorrect.


> References to a specific side's key in a USING join can have wrong nullability
> --
>
> Key: SPARK-43718
> URL: https://issues.apache.org/jira/browse/SPARK-43718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Assume this data:
> {noformat}
> create or replace temp view t1 as values (1), (2), (3) as (c1);
> create or replace temp view t2 as values (2), (3), (4) as (c1);
> {noformat}
> The following query produces incorrect results:
> {noformat}
> spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
> from t1
> full outer join t2
> using (c1);
> 1
> -1  <== should be null
> 2
> 2
> 3
> 3
> -1  <== should be null
> 4
> Time taken: 0.663 seconds, Fetched 8 row(s)
> spark-sql (default)> 
> {noformat}
> Similar issues occur with right outer join and left outer join.
> {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
> resolved, so the array's {{containsNull}} value is incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability

2023-05-22 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-43718:
--
Description: 
Assume this data:
{noformat}
create or replace temp view t1 as values (1), (2), (3) as (c1);
create or replace temp view t2 as values (2), (3), (4) as (c1);
{noformat}
The following query produces the wrong result:
{noformat}
spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
from t1
full outer join t2
using (c1);
1
-1  <== should be null
2
2
3
3
-1  <== should be null
4
Time taken: 0.663 seconds, Fetched 8 row(s)
spark-sql (default)> 
{noformat}
Similar issues occur with right outer join and left outer join.

{{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
resolved, so the array's {{containsNull}} value is incorrect.

  was:
Assume this data:
{noformat}
create or replace temp view t1 as values (1), (2), (3) as (c1);
create or replace temp view t2 as values (2), (3), (4) as (c1);
{noformat}
The following query produces the wrong result:
{noformat}
spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
from t1
full outer join t2
using (c1);
1
-1  <== should be null
2
2
3
3
-1  <== should be null
4
Time taken: 0.663 seconds, Fetched 8 row(s)
spark-sql (default)> 
{noformat}
Similar issues occur with right outer join and left outer join.



> References to a specific side's key in a USING join can have wrong nullability
> --
>
> Key: SPARK-43718
> URL: https://issues.apache.org/jira/browse/SPARK-43718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Assume this data:
> {noformat}
> create or replace temp view t1 as values (1), (2), (3) as (c1);
> create or replace temp view t2 as values (2), (3), (4) as (c1);
> {noformat}
> The following query produces the wrong result:
> {noformat}
> spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
> from t1
> full outer join t2
> using (c1);
> 1
> -1  <== should be null
> 2
> 2
> 3
> 3
> -1  <== should be null
> 4
> Time taken: 0.663 seconds, Fetched 8 row(s)
> spark-sql (default)> 
> {noformat}
> Similar issues occur with right outer join and left outer join.
> {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
> resolved, so the array's {{containsNull}} value is incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability

2023-05-22 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-43718:
-

 Summary: References to a specific side's key in a USING join can 
have wrong nullability
 Key: SPARK-43718
 URL: https://issues.apache.org/jira/browse/SPARK-43718
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


Assume this data:
{noformat}
create or replace temp view t1 as values (1), (2), (3) as (c1);
create or replace temp view t2 as values (2), (3), (4) as (c1);
{noformat}
The following query produces the wrong result:
{noformat}
spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
from t1
full outer join t2
using (c1);
1
-1  <== should be null
2
2
3
3
-1  <== should be null
4
Time taken: 0.663 seconds, Fetched 8 row(s)
spark-sql (default)> 
{noformat}
Similar issues occur with right outer join and left outer join.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43149) When CREATE USING fails to store metadata in metastore, data gets left around

2023-04-14 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-43149:
-

 Summary: When CREATE USING fails to store metadata in metastore, 
data gets left around
 Key: SPARK-43149
 URL: https://issues.apache.org/jira/browse/SPARK-43149
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


For example:
{noformat}
drop table if exists parquet_ds1;

-- try creating table with invalid column name
-- use 'using parquet' to designate the data source
create table parquet_ds1 using parquet as
select id, date'2018-01-01' + make_dt_interval(0, id)
from range(0, 10);

Cannot create a table having a column whose name contains commas in Hive 
metastore. Table: `spark_catalog`.`default`.`parquet_ds1`; Column: DATE 
'2018-01-01' + make_dt_interval(0, id, 0, 0.00)

-- show that table did not get created
show tables;


-- try again with valid column name
-- spark will complain that directory already exists
create table parquet_ds1 using parquet as
select id, date'2018-01-01' + make_dt_interval(0, id) as ts
from range(0, 10);

[LOCATION_ALREADY_EXISTS] Cannot name the managed table as 
`spark_catalog`.`default`.`parquet_ds1`, as its associated location 
'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already 
exists. Please pick a different table name, or remove the existing location 
first.
org.apache.spark.SparkRuntimeException: [LOCATION_ALREADY_EXISTS] Cannot name 
the managed table as `spark_catalog`.`default`.`parquet_ds1`, as its associated 
location 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' 
already exists. Please pick a different table name, or remove the existing 
location first.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.locationAlreadyExists(QueryExecutionErrors.scala:2804)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:414)
at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
...
{noformat}
One must manually remove the directory {{spark-warehouse/parquet_ds1}} before 
the {{create table}} command will succeed.

It seems that datasource table creation runs the data-creation job first, then 
stores the metadata into the metastore.

When using Spark to create Hive tables, the issue does not happen:
{noformat}
drop table if exists parquet_hive1;

-- try creating table with invalid column name,
-- but use 'stored as parquet' instead of 'using'
create table parquet_hive1 stored as parquet as
select id, date'2018-01-01' + make_dt_interval(0, id)
from range(0, 10);

Cannot create a table having a column whose name contains commas in Hive 
metastore. Table: `spark_catalog`.`default`.`parquet_hive1`; Column: DATE 
'2018-01-01' + make_dt_interval(0, id, 0, 0.00)

-- try again with valid column name. This will succeed;
create table parquet_hive1 stored as parquet as
select id, date'2018-01-01' + make_dt_interval(0, id) as ts
from range(0, 10);
{noformat}

It seems that Hive table creation stores metadata into the metastore first, 
then runs the data-creation job.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43149) When CTAS with USING fails to store metadata in metastore, data gets left around

2023-04-14 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-43149:
--
Summary: When CTAS with USING fails to store metadata in metastore, data 
gets left around  (was: When CREATE USING fails to store metadata in metastore, 
data gets left around)

> When CTAS with USING fails to store metadata in metastore, data gets left 
> around
> 
>
> Key: SPARK-43149
> URL: https://issues.apache.org/jira/browse/SPARK-43149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> For example:
> {noformat}
> drop table if exists parquet_ds1;
> -- try creating table with invalid column name
> -- use 'using parquet' to designate the data source
> create table parquet_ds1 using parquet as
> select id, date'2018-01-01' + make_dt_interval(0, id)
> from range(0, 10);
> Cannot create a table having a column whose name contains commas in Hive 
> metastore. Table: `spark_catalog`.`default`.`parquet_ds1`; Column: DATE 
> '2018-01-01' + make_dt_interval(0, id, 0, 0.00)
> -- show that table did not get created
> show tables;
> -- try again with valid column name
> -- spark will complain that directory already exists
> create table parquet_ds1 using parquet as
> select id, date'2018-01-01' + make_dt_interval(0, id) as ts
> from range(0, 10);
> [LOCATION_ALREADY_EXISTS] Cannot name the managed table as 
> `spark_catalog`.`default`.`parquet_ds1`, as its associated location 
> 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already 
> exists. Please pick a different table name, or remove the existing location 
> first.
> org.apache.spark.SparkRuntimeException: [LOCATION_ALREADY_EXISTS] Cannot name 
> the managed table as `spark_catalog`.`default`.`parquet_ds1`, as its 
> associated location 
> 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already 
> exists. Please pick a different table name, or remove the existing location 
> first.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.locationAlreadyExists(QueryExecutionErrors.scala:2804)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:414)
>   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
> ...
> {noformat}
> One must manually remove the directory {{spark-warehouse/parquet_ds1}} before 
> the {{create table}} command will succeed.
> It seems that datasource table creation runs the data-creation job first, 
> then stores the metadata into the metastore.
> When using Spark to create Hive tables, the issue does not happen:
> {noformat}
> drop table if exists parquet_hive1;
> -- try creating table with invalid column name,
> -- but use 'stored as parquet' instead of 'using'
> create table parquet_hive1 stored as parquet as
> select id, date'2018-01-01' + make_dt_interval(0, id)
> from range(0, 10);
> Cannot create a table having a column whose name contains commas in Hive 
> metastore. Table: `spark_catalog`.`default`.`parquet_hive1`; Column: DATE 
> '2018-01-01' + make_dt_interval(0, id, 0, 0.00)
> -- try again with valid column name. This will succeed;
> create table parquet_hive1 stored as parquet as
> select id, date'2018-01-01' + make_dt_interval(0, id) as ts
> from range(0, 10);
> {noformat}
> It seems that Hive table creation stores metadata into the metastore first, 
> then runs the data-creation job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43113) Codegen error when full outer join's bound condition has multiple references to the same stream-side column

2023-04-14 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711614#comment-17711614
 ] 

Bruce Robbins edited comment on SPARK-43113 at 4/14/23 6:02 AM:


PR here: https://github.com/apache/spark/pull/40766


was (Author: bersprockets):
PR here: https://github.com/apache/spark/pull/40766/files

> Codegen error when full outer join's bound condition has multiple references 
> to the same stream-side column
> ---
>
> Key: SPARK-43113
> URL: https://issues.apache.org/jira/browse/SPARK-43113
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Example # 1 (sort merge join):
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 1),
> (2, 2),
> (3, 1)
> as v1(key, value);
> create or replace temp view v2 as
> select * from values
> (1, 22, 22),
> (3, -1, -1),
> (7, null, null)
> as v2(a, b, c);
> select *
> from v1
> full outer join v2
> on key = a
> and value > b
> and value > c;
> {noformat}
> The join's generated code causes the following compilation error:
> {noformat}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 277, Column 9: Redefinition of local variable "smj_isNull_7"
> {noformat}
> Example #2 (shuffle hash join):
> {noformat}
> select /*+ SHUFFLE_HASH(v2) */ *
> from v1
> full outer join v2
> on key = a
> and value > b
> and value > c;
> {noformat}
> The shuffle hash join's generated code causes the following compilation error:
> {noformat}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 174, Column 5: Redefinition of local variable "shj_value_1" 
> {noformat}
> With default configuration, both queries end up succeeding, since Spark falls 
> back to running each query with whole-stage codegen disabled.
> The issue happens only when the join's bound condition refers to the same 
> stream-side column more than once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43113) Codegen error when full outer join's bound condition has multiple references to the same stream-side column

2023-04-12 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711614#comment-17711614
 ] 

Bruce Robbins commented on SPARK-43113:
---

PR here: https://github.com/apache/spark/pull/40766/files

> Codegen error when full outer join's bound condition has multiple references 
> to the same stream-side column
> ---
>
> Key: SPARK-43113
> URL: https://issues.apache.org/jira/browse/SPARK-43113
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Example # 1 (sort merge join):
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 1),
> (2, 2),
> (3, 1)
> as v1(key, value);
> create or replace temp view v2 as
> select * from values
> (1, 22, 22),
> (3, -1, -1),
> (7, null, null)
> as v2(a, b, c);
> select *
> from v1
> full outer join v2
> on key = a
> and value > b
> and value > c;
> {noformat}
> The join's generated code causes the following compilation error:
> {noformat}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 277, Column 9: Redefinition of local variable "smj_isNull_7"
> {noformat}
> Example #2 (shuffle hash join):
> {noformat}
> select /*+ SHUFFLE_HASH(v2) */ *
> from v1
> full outer join v2
> on key = a
> and value > b
> and value > c;
> {noformat}
> The shuffle hash join's generated code causes the following compilation error:
> {noformat}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 174, Column 5: Redefinition of local variable "shj_value_1" 
> {noformat}
> With default configuration, both queries end up succeeding, since Spark falls 
> back to running each query with whole-stage codegen disabled.
> The issue happens only when the join's bound condition refers to the same 
> stream-side column more than once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43113) Codegen error when full outer join's bound condition has multiple references to the same stream-side column

2023-04-12 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-43113:
-

 Summary: Codegen error when full outer join's bound condition has 
multiple references to the same stream-side column
 Key: SPARK-43113
 URL: https://issues.apache.org/jira/browse/SPARK-43113
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.2, 3.4.0, 3.5.0
Reporter: Bruce Robbins


Example # 1 (sort merge join):
{noformat}
create or replace temp view v1 as
select * from values
(1, 1),
(2, 2),
(3, 1)
as v1(key, value);

create or replace temp view v2 as
select * from values
(1, 22, 22),
(3, -1, -1),
(7, null, null)
as v2(a, b, c);

select *
from v1
full outer join v2
on key = a
and value > b
and value > c;
{noformat}
The join's generated code causes the following compilation error:
{noformat}
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
277, Column 9: Redefinition of local variable "smj_isNull_7"
{noformat}
Example #2 (shuffle hash join):
{noformat}
select /*+ SHUFFLE_HASH(v2) */ *
from v1
full outer join v2
on key = a
and value > b
and value > c;
{noformat}
The shuffle hash join's generated code causes the following compilation error:
{noformat}
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
174, Column 5: Redefinition of local variable "shj_value_1" 
{noformat}
With default configuration, both queries end up succeeding, since Spark falls 
back to running each query with whole-stage codegen disabled.

The issue happens only when the join's bound condition refers to the same 
stream-side column more than once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled

2023-03-27 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705702#comment-17705702
 ] 

Bruce Robbins commented on SPARK-42937:
---

PR at https://github.com/apache/spark/pull/40569

> Join with subquery in condition can fail with wholestage codegen and adaptive 
> execution disabled
> 
>
> Key: SPARK-42937
> URL: https://issues.apache.org/jira/browse/SPARK-42937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The below left outer join gets an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
> (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
> (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
> as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, 
> value9, value10);
> create or replace temp view v2 as
> select * from values
> (1, 2),
> (3, 8),
> (7, 9)
> as v2(a, b);
> create or replace temp view v3 as
> select * from values
> (3),
> (8)
> as v3(col1);
> set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100
> set spark.sql.adaptive.enabled=false;
> select *
> from v1
> left outer join v2
> on key = a
> and key in (select col1 from v3);
> {noformat}
> The join fails during predicate codegen:
> {noformat}
> 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to 
> interpreter mode
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278)
>   at scala.collection.immutable.List.map(List.scala:293)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40)
> {noformat}
> It fails again after fallback to interpreter mode:
> {noformat}
> 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205)
> {noformat}
> Both the predicate codegen and the evaluation fail for the same reason: 
> {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. 
> The driver waits for the subquery to finish, but it's the executor that uses 
> the results of the subquery (for predicate codegen 

[jira] [Updated] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled

2023-03-27 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-42937:
--
Affects Version/s: 3.4.0

> Join with subquery in condition can fail with wholestage codegen and adaptive 
> execution disabled
> 
>
> Key: SPARK-42937
> URL: https://issues.apache.org/jira/browse/SPARK-42937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The below left outer join gets an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
> (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
> (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
> as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, 
> value9, value10);
> create or replace temp view v2 as
> select * from values
> (1, 2),
> (3, 8),
> (7, 9)
> as v2(a, b);
> create or replace temp view v3 as
> select * from values
> (3),
> (8)
> as v3(col1);
> set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100
> set spark.sql.adaptive.enabled=false;
> select *
> from v1
> left outer join v2
> on key = a
> and key in (select col1 from v3);
> {noformat}
> The join fails during predicate codegen:
> {noformat}
> 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to 
> interpreter mode
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278)
>   at scala.collection.immutable.List.map(List.scala:293)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40)
> {noformat}
> It fails again after fallback to interpreter mode:
> {noformat}
> 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205)
> {noformat}
> Both the predicate codegen and the evaluation fail for the same reason: 
> {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. 
> The driver waits for the subquery to finish, but it's the executor that uses 
> the results of the subquery (for predicate codegen or evaluation). Because 
> {{shouldBroadcast}} is set to 

[jira] [Updated] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled

2023-03-27 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-42937:
--
Affects Version/s: 3.3.2

> Join with subquery in condition can fail with wholestage codegen and adaptive 
> execution disabled
> 
>
> Key: SPARK-42937
> URL: https://issues.apache.org/jira/browse/SPARK-42937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The below left outer join gets an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
> (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
> (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
> as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, 
> value9, value10);
> create or replace temp view v2 as
> select * from values
> (1, 2),
> (3, 8),
> (7, 9)
> as v2(a, b);
> create or replace temp view v3 as
> select * from values
> (3),
> (8)
> as v3(col1);
> set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100
> set spark.sql.adaptive.enabled=false;
> select *
> from v1
> left outer join v2
> on key = a
> and key in (select col1 from v3);
> {noformat}
> The join fails during predicate codegen:
> {noformat}
> 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to 
> interpreter mode
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278)
>   at scala.collection.immutable.List.map(List.scala:293)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40)
> {noformat}
> It fails again after fallback to interpreter mode:
> {noformat}
> 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205)
> {noformat}
> Both the predicate codegen and the evaluation fail for the same reason: 
> {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. 
> The driver waits for the subquery to finish, but it's the executor that uses 
> the results of the subquery (for predicate codegen or evaluation). Because 
> {{shouldBroadcast}} is set to false, the 

[jira] [Created] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled

2023-03-27 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-42937:
-

 Summary: Join with subquery in condition can fail with wholestage 
codegen and adaptive execution disabled
 Key: SPARK-42937
 URL: https://issues.apache.org/jira/browse/SPARK-42937
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


The below left outer join gets an error:
{noformat}
create or replace temp view v1 as
select * from values
(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
(3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, 
value9, value10);

create or replace temp view v2 as
select * from values
(1, 2),
(3, 8),
(7, 9)
as v2(a, b);

create or replace temp view v3 as
select * from values
(3),
(8)
as v3(col1);

set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100
set spark.sql.adaptive.enabled=false;

select *
from v1
left outer join v2
on key = a
and key in (select col1 from v3);
{noformat}
The join fails during predicate codegen:
{noformat}
23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to 
interpreter mode
java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN 
subquery#34 has not finished
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
at 
org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156)
at 
org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278)
at scala.collection.immutable.List.map(List.scala:293)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33)
at 
org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73)
at 
org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70)
at 
org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51)
at 
org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86)
at 
org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146)
at 
org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40)
{noformat}
It fails again after fallback to interpreter mode:
{noformat}
23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN 
subquery#34 has not finished
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
at 
org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
at 
org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146)
at 
org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146)
at 
org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205)
{noformat}
Both the predicate codegen and the evaluation fail for the same reason: 
{{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. 
The driver waits for the subquery to finish, but it's the executor that uses 
the results of the subquery (for predicate codegen or evaluation). Because 
{{shouldBroadcast}} is set to false, the result is stored in a transient field 
({{InSubqueryExec#result}}), so the result of the subquery is not serialized 
when the {{InSubqueryExec}} instance is sent to the executor.

When wholestage codegen is enabled, the predicate codegen happens on the 
driver, so the subquery's result is available. When adaptive execution is 
enabled, {{PlanAdaptiveSubqueries}} always sets {{shouldBroadcast=true}}, so 
the 

[jira] [Commented] (SPARK-42909) INSERT INTO with column list does not work

2023-03-23 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17704368#comment-17704368
 ] 

Bruce Robbins commented on SPARK-42909:
---

It looks like this capability landed in 3.4/3.5 with SPARK-42521.

> INSERT INTO with column list does not work
> --
>
> Key: SPARK-42909
> URL: https://issues.apache.org/jira/browse/SPARK-42909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
> Environment: Databricks DBR12.2 on AZure, running Spark 3.3.2
> Documentation: [INSERT - Azure Databricks - Databricks SQL | Microsoft 
> Learn|https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-syntax-dml-insert-into]
>Reporter: Tjomme Vergauwen
>Priority: Major
>  Labels: databricks, documentation, spark-sql, sql
>
> Hi,
> When performing a INSERT INTO with a defined incomplete column list, the 
> missing columns should get a NULL value. However, an error is thrown 
> indicating that the column is missing.
> *Case simulation:*
> drop table if exists default.TVTest;
> create table default.TVTest
> ( col1 int NOT NULL
> , col2 int
> );
> insert into default.TVTest select 1,2;
> insert into default.TVTest select 2,NULL; --> col2 can contain NULL values
> insert into default.TVTest (col1) select 3; -- Error in SQL statement: 
> DeltaAnalysisException: Column col2 is not specified in INSERT
> insert into default.TVTest (col1) VALUES (3); -- Error in SQL statement: 
> DeltaAnalysisException: Column col2 is not specified in INSERT
> select * from default.TVTest;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append

2023-02-14 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688759#comment-17688759
 ] 

Bruce Robbins commented on SPARK-42401:
---

There is another case:
{noformat}
spark-sql> select array_insert(array('1', '2', '3', '4'), -6, '5');
23/02/14 16:10:19 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{noformat}
{{array_insert}} might implicitly add nulls, and my fix does not cover that 
case. I will follow up.

> Incorrect results or NPE when inserting null value into array using 
> array_insert/array_append
> -
>
> Key: SPARK-42401
> URL: https://issues.apache.org/jira/browse/SPARK-42401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.4.0
>
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array(1, 2, 3, 4), 5, 5),
> (array(1, 2, 3, 4), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> This produces an incorrect result:
> {noformat}
> [1,2,3,4,5]
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> A more succint example:
> {noformat}
> select array_insert(array(1, 2, 3, 4), 5, cast(null as int));
> {noformat}
> This also produces an incorrect result:
> {noformat}
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> Another example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array('1', '2', '3', '4'), 5, '5'),
> (array('1', '2', '3', '4'), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> The above query throws a {{NullPointerException}}:
> {noformat}
> 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, 
> col2, col3) from v1]
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
> {noformat}
> {{array_append}} has the same issue:
> {noformat}
> spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int));
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> Time taken: 3.679 seconds, Fetched 1 row(s)
> spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as 
> string));
> 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append

2023-02-12 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-42401:
--
Summary: Incorrect results or NPE when inserting null value into array 
using array_insert/array_append  (was: Incorrect results or NPE when inserting 
null value using array_insert/array_append)

> Incorrect results or NPE when inserting null value into array using 
> array_insert/array_append
> -
>
> Key: SPARK-42401
> URL: https://issues.apache.org/jira/browse/SPARK-42401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array(1, 2, 3, 4), 5, 5),
> (array(1, 2, 3, 4), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> This produces an incorrect result:
> {noformat}
> [1,2,3,4,5]
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> A more succint example:
> {noformat}
> select array_insert(array(1, 2, 3, 4), 5, cast(null as int));
> {noformat}
> This also produces an incorrect result:
> {noformat}
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> Another example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array('1', '2', '3', '4'), 5, '5'),
> (array('1', '2', '3', '4'), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> The above query throws a {{NullPointerException}}:
> {noformat}
> 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, 
> col2, col3) from v1]
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
> {noformat}
> {{array_append}} has the same issue:
> {noformat}
> spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int));
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> Time taken: 3.679 seconds, Fetched 1 row(s)
> spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as 
> string));
> 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42401) Incorrect results or NPE when inserting null value using array_insert/array_append

2023-02-10 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-42401:
--
Labels: correctness  (was: )

> Incorrect results or NPE when inserting null value using 
> array_insert/array_append
> --
>
> Key: SPARK-42401
> URL: https://issues.apache.org/jira/browse/SPARK-42401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array(1, 2, 3, 4), 5, 5),
> (array(1, 2, 3, 4), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> This produces an incorrect result:
> {noformat}
> [1,2,3,4,5]
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> A more succint example:
> {noformat}
> select array_insert(array(1, 2, 3, 4), 5, cast(null as int));
> {noformat}
> This also produces an incorrect result:
> {noformat}
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> Another example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array('1', '2', '3', '4'), 5, '5'),
> (array('1', '2', '3', '4'), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> The above query throws a {{NullPointerException}}:
> {noformat}
> 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, 
> col2, col3) from v1]
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
> {noformat}
> {{array_append}} has the same issue:
> {noformat}
> spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int));
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> Time taken: 3.679 seconds, Fetched 1 row(s)
> spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as 
> string));
> 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42401) Incorrect results or NPE when inserting null value using array_insert/array_append

2023-02-10 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-42401:
-

 Summary: Incorrect results or NPE when inserting null value using 
array_insert/array_append
 Key: SPARK-42401
 URL: https://issues.apache.org/jira/browse/SPARK-42401
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0, 3.5.0
Reporter: Bruce Robbins


Example:
{noformat}
create or replace temp view v1 as
select * from values
(array(1, 2, 3, 4), 5, 5),
(array(1, 2, 3, 4), 5, null)
as v1(col1,col2,col3);

select array_insert(col1, col2, col3) from v1;
{noformat}
This produces an incorrect result:
{noformat}
[1,2,3,4,5]
[1,2,3,4,0] <== should be [1,2,3,4,null]
{noformat}
A more succint example:
{noformat}
select array_insert(array(1, 2, 3, 4), 5, cast(null as int));
{noformat}
This also produces an incorrect result:
{noformat}
[1,2,3,4,0] <== should be [1,2,3,4,null]
{noformat}
Another example:
{noformat}
create or replace temp view v1 as
select * from values
(array('1', '2', '3', '4'), 5, '5'),
(array('1', '2', '3', '4'), 5, null)
as v1(col1,col2,col3);

select array_insert(col1, col2, col3) from v1;
{noformat}
The above query throws a {{NullPointerException}}:
{noformat}
23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, 
col2, col3) from v1]
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
{noformat}
{{array_append}} has the same issue:
{noformat}
spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int));
[1,2,3,4,0] <== should be [1,2,3,4,null]
Time taken: 3.679 seconds, Fetched 1 row(s)
spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as string));
23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42384) Mask function's generated code does not handle null input

2023-02-08 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-42384:
--
Affects Version/s: 3.4.0

> Mask function's generated code does not handle null input
> -
>
> Key: SPARK-42384
> URL: https://issues.apache.org/jira/browse/SPARK-42384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (null),
> ('AbCD123-@$#')
> as data(col1);
> cache table v1;
> select mask(col1) from v1;
> {noformat}
> This query results in a {{NullPointerException}}:
> {noformat}
> 23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
> {noformat}
> The generated code calls {{UnsafeWriter.write(0, value_0)}} regardless of 
> whether {{Mask.transformInput}} returns null or not. The 
> {{UnsafeWriter.write}} method for {{UTF8String}} does not expect a null 
> pointer.
> {noformat}
> /* 031 */ boolean isNull_1 = i.isNullAt(0);
> /* 032 */ UTF8String value_1 = isNull_1 ?
> /* 033 */ null : (i.getUTF8String(0));
> /* 034 */
> /* 035 */
> /* 036 */
> /* 037 */
> /* 038 */ UTF8String value_0 = null;
> /* 039 */ value_0 = 
> org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, 
> ((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* 
> literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) 
> references[3] /* literal */));;
> /* 040 */ if (false) {
> /* 041 */   mutableStateArray_0[0].setNullAt(0);
> /* 042 */ } else {
> /* 043 */   mutableStateArray_0[0].write(0, value_0);
> /* 044 */ }
> /* 045 */ return (mutableStateArray_0[0].getRow());
> /* 046 */   }
> {noformat}
> The bug is not exercised by a literal null input value, since there appears 
> to be some optimization that simply replaces the entire function call with a 
> null literal:
> {noformat}
> spark-sql> explain SELECT mask(NULL);
> == Physical Plan ==
> *(1) Project [null AS mask(NULL, X, x, n, NULL)#47]
> +- *(1) Scan OneRowRelation[]
> Time taken: 0.026 seconds, Fetched 1 row(s)
> spark-sql> SELECT mask(NULL);
> NULL
> Time taken: 0.042 seconds, Fetched 1 row(s)
> spark-sql> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42384) Mask function's generated code does not handle null input

2023-02-08 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-42384:
-

 Summary: Mask function's generated code does not handle null input
 Key: SPARK-42384
 URL: https://issues.apache.org/jira/browse/SPARK-42384
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


Example:
{noformat}
create or replace temp view v1 as
select * from values
(null),
('AbCD123-@$#')
as data(col1);

cache table v1;

select mask(col1) from v1;
{noformat}
This query results in a {{NullPointerException}}:
{noformat}
23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
{noformat}
The generated code calls {{UnsafeWriter.write(0, value_0)}} regardless of 
whether {{Mask.transformInput}} returns null or not. The {{UnsafeWriter.write}} 
method for {{UTF8String}} does not expect a null pointer.
{noformat}
/* 031 */ boolean isNull_1 = i.isNullAt(0);
/* 032 */ UTF8String value_1 = isNull_1 ?
/* 033 */ null : (i.getUTF8String(0));
/* 034 */
/* 035 */
/* 036 */
/* 037 */
/* 038 */ UTF8String value_0 = null;
/* 039 */ value_0 = 
org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, 
((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* 
literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) 
references[3] /* literal */));;
/* 040 */ if (false) {
/* 041 */   mutableStateArray_0[0].setNullAt(0);
/* 042 */ } else {
/* 043 */   mutableStateArray_0[0].write(0, value_0);
/* 044 */ }
/* 045 */ return (mutableStateArray_0[0].getRow());
/* 046 */   }
{noformat}

The bug is not exercised by a literal null input value, since there appears to 
be some optimization that simply replaces the entire function call with a null 
literal:
{noformat}
spark-sql> explain SELECT mask(NULL);
== Physical Plan ==
*(1) Project [null AS mask(NULL, X, x, n, NULL)#47]
+- *(1) Scan OneRowRelation[]

Time taken: 0.026 seconds, Fetched 1 row(s)
spark-sql> SELECT mask(NULL);
NULL
Time taken: 0.042 seconds, Fetched 1 row(s)
spark-sql> 
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41991) Interpreted mode subexpression elimination can throw exception during insert

2023-01-11 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-41991:
--
Affects Version/s: 3.3.1

> Interpreted mode subexpression elimination can throw exception during insert
> 
>
> Key: SPARK-41991
> URL: https://issues.apache.org/jira/browse/SPARK-41991
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1, 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Example:
> {noformat}
> drop table if exists tbl1;
> create table tbl1 (a int, b int) using parquet;
> set spark.sql.codegen.wholeStage=false;
> set spark.sql.codegen.factoryMode=NO_CODEGEN;
> insert into tbl1
> select id as a, id as b
> from range(1, 5);
> {noformat}
> This results in the following exception:
> {noformat}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.ExpressionProxy cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.Cast
>   at 
> org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2514)
>   at 
> org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2512)
> {noformat}
> The query produces 2 bigint values, but the table's schema expects 2 int 
> values, so Spark wraps each output field with a {{Cast}}.
> Later, in {{InterpretedUnsafeProjection}}, {{prepareExpressions}} tries to 
> wrap the two {{Cast}} expressions with an {{ExpressionProxy}}. However, the 
> parent expression of each {{Cast}} is a {{CheckOverflowInTableInsert}} 
> expression, which does not accept {{ExpressionProxy}} as a child.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41991) Interpreted mode subexpression elimination can throw exception during insert

2023-01-11 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-41991:
-

 Summary: Interpreted mode subexpression elimination can throw 
exception during insert
 Key: SPARK-41991
 URL: https://issues.apache.org/jira/browse/SPARK-41991
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Bruce Robbins


Example:
{noformat}
drop table if exists tbl1;
create table tbl1 (a int, b int) using parquet;

set spark.sql.codegen.wholeStage=false;
set spark.sql.codegen.factoryMode=NO_CODEGEN;

insert into tbl1
select id as a, id as b
from range(1, 5);
{noformat}
This results in the following exception:
{noformat}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.ExpressionProxy cannot be cast to 
org.apache.spark.sql.catalyst.expressions.Cast
at 
org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2514)
at 
org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2512)
{noformat}
The query produces 2 bigint values, but the table's schema expects 2 int 
values, so Spark wraps each output field with a {{Cast}}.

Later, in {{InterpretedUnsafeProjection}}, {{prepareExpressions}} tries to wrap 
the two {{Cast}} expressions with an {{ExpressionProxy}}. However, the parent 
expression of each {{Cast}} is a {{CheckOverflowInTableInsert}} expression, 
which does not accept {{ExpressionProxy}} as a child.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41804) InterpretedUnsafeProjection doesn't properly handle an array of UDTs

2022-12-31 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-41804:
--
Description: 
Reproduction steps:
{noformat}
// create a file of vector data
import org.apache.spark.ml.linalg.{DenseVector, Vector}

case class TestRow(varr: Array[Vector])
val values = Array(0.1d, 0.2d, 0.3d)
val dv = new DenseVector(values).asInstanceOf[Vector]

val ds = Seq(TestRow(Array(dv, dv))).toDS
ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data")

// this works
spark.read.format("parquet").load("vector_data").collect

sql("set spark.sql.codegen.wholeStage=false")
sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")

// this will get an error
spark.read.format("parquet").load("vector_data").collect
{noformat}
The error varies each time you run it, e.g.:
{noformat}
Sparse vectors require that the dimension of the indices match the dimension of 
the values.
You provided 2 indices and  6619240 values.
{noformat}
or
{noformat}
org.apache.spark.SparkRuntimeException: Error while decoding: 
java.lang.NegativeArraySizeException
{noformat}
or
{noformat}
java.lang.OutOfMemoryError: Java heap space
  at 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414)
{noformat}
or
{noformat}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003
#
# JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build 
1.8.0_311-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 
compressed oops)
# Problematic frame:
# V  [libjvm.dylib+0xc9d30]  acl_CopyRight+0x29
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# //hs_err_pid64213.log
Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
(native)
 total in heap  [0x00011efa8890,0x00011efa8be8] = 856
 relocation [0x00011efa89b8,0x00011efa89f8] = 64
 main code  [0x00011efa8a00,0x00011efa8be8] = 488
Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
(native)
 total in heap  [0x00011efa8890,0x00011efa8be8] = 856
 relocation [0x00011efa89b8,0x00011efa89f8] = 64
 main code  [0x00011efa8a00,0x00011efa8be8] = 488
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
{noformat}

  was:
Reproduction steps:
{noformat}
// create a file of vector data
import org.apache.spark.ml.linalg.{DenseMatrix, DenseVector, Matrix, Vector}

case class TestRow(varr: Array[Vector])
val values = Array(0.1d, 0.2d, 0.3d)
val dv = new DenseVector(values).asInstanceOf[Vector]

val ds = Seq(TestRow(Array(dv, dv))).toDS
ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data")

// this works
spark.read.format("parquet").load("vector_data").collect

sql("set spark.sql.codegen.wholeStage=false")
sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")

// this will get an error
spark.read.format("parquet").load("vector_data").collect
{noformat}
The error varies each time you run it, e.g.:
{noformat}
Sparse vectors require that the dimension of the indices match the dimension of 
the values.
You provided 2 indices and  6619240 values.
{noformat}
or
{noformat}
org.apache.spark.SparkRuntimeException: Error while decoding: 
java.lang.NegativeArraySizeException
{noformat}
or
{noformat}
java.lang.OutOfMemoryError: Java heap space
  at 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414)
{noformat}
or
{noformat}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003
#
# JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build 
1.8.0_311-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 
compressed oops)
# Problematic frame:
# V  [libjvm.dylib+0xc9d30]  acl_CopyRight+0x29
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# //hs_err_pid64213.log
Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
(native)
 total in heap  [0x00011efa8890,0x00011efa8be8] = 856
 relocation [0x00011efa89b8,0x00011efa89f8] = 64
 main code  [0x00011efa8a00,0x00011efa8be8] = 488
Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
(native)
 total in heap  [0x00011efa8890,0x00011efa8be8] = 856
 relocation [0x00011efa89b8,0x00011efa89f8] = 64
 main code  

  1   2   3   4   5   >