[jira] [Updated] (SPARK-47299) Use the same `versions. json` in the dropdown of different versions of PySpark documents

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47299:
---
Labels: pull-request-available  (was: )

> Use the same `versions. json` in the dropdown of different versions of 
> PySpark documents
> 
>
> Key: SPARK-47299
> URL: https://issues.apache.org/jira/browse/SPARK-47299
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47299) Use the same `versions. json` in the dropdown of different versions of PySpark documents

2024-03-05 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-47299:
---

 Summary: Use the same `versions. json` in the dropdown of 
different versions of PySpark documents
 Key: SPARK-47299
 URL: https://issues.apache.org/jira/browse/SPARK-47299
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 3.5.1, 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47298) Upgrade `mysql-connector-j` to `8.3.0` and `mariadb-java-client` to `2.7.12`

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47298:
---
Labels: pull-request-available  (was: )

> Upgrade `mysql-connector-j` to `8.3.0` and `mariadb-java-client` to `2.7.12`
> 
>
> Key: SPARK-47298
> URL: https://issues.apache.org/jira/browse/SPARK-47298
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47298) Upgrade `mysql-connector-j` to `8.3.0` and `mariadb-java-client` to `2.7.12`

2024-03-05 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-47298:
---

 Summary: Upgrade `mysql-connector-j` to `8.3.0` and 
`mariadb-java-client` to `2.7.12`
 Key: SPARK-47298
 URL: https://issues.apache.org/jira/browse/SPARK-47298
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47297) regular expressions

2024-03-05 Thread Jira
Uroš Bojanić created SPARK-47297:


 Summary: regular expressions
 Key: SPARK-47297
 URL: https://issues.apache.org/jira/browse/SPARK-47297
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47293) Build batchSchema with total sparkSchema instead of append one by one

2024-03-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47293.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45396
[https://github.com/apache/spark/pull/45396]

> Build batchSchema with total sparkSchema instead of append one by one
> -
>
> Key: SPARK-47293
> URL: https://issues.apache.org/jira/browse/SPARK-47293
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Binjie Yang
>Assignee: Binjie Yang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We can simply init batchSchema with whole sparkSchema instead of append one 
> by one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47293) Build batchSchema with total sparkSchema instead of append one by one

2024-03-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47293:


Assignee: Binjie Yang

> Build batchSchema with total sparkSchema instead of append one by one
> -
>
> Key: SPARK-47293
> URL: https://issues.apache.org/jira/browse/SPARK-47293
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Binjie Yang
>Assignee: Binjie Yang
>Priority: Minor
>  Labels: pull-request-available
>
> We can simply init batchSchema with whole sparkSchema instead of append one 
> by one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47296) fail all unsupported functions

2024-03-05 Thread Jira
Uroš Bojanić created SPARK-47296:


 Summary: fail all unsupported functions
 Key: SPARK-47296
 URL: https://issues.apache.org/jira/browse/SPARK-47296
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47295) startswith, endswith (non-binary collations)

2024-03-05 Thread Jira
Uroš Bojanić created SPARK-47295:


 Summary: startswith, endswith (non-binary collations)
 Key: SPARK-47295
 URL: https://issues.apache.org/jira/browse/SPARK-47295
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46835) Join support for strings with collation

2024-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46835.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45389
[https://github.com/apache/spark/pull/45389]

> Join support for strings with collation
> ---
>
> Key: SPARK-46835
> URL: https://issues.apache.org/jira/browse/SPARK-46835
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46835) Join support for strings with collation

2024-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46835:
---

Assignee: Aleksandar Tomic

> Join support for strings with collation
> ---
>
> Key: SPARK-46835
> URL: https://issues.apache.org/jira/browse/SPARK-46835
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47294) OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec)

2024-03-05 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-47294.
---
Resolution: Not A Problem

> OptimizeSkewInRebalanceRepartitions should support 
> ProjectExec(_,ShuffleQueryStageExec)
> ---
>
> Key: SPARK-47294
> URL: https://issues.apache.org/jira/browse/SPARK-47294
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: angerszhu
>Priority: Major
>  Labels: pull-request-available
>
> Current OptimizeSkewInRebalanceRepartitions only support match 
> ShuffleQueryStageExec, this case only support SQL query, can't work when 
> insert since there have a project between ShuffleQueryStageExec and insert 
> command
> {code:java}
> plan transformUp {
>   case p @ ProjectExec(_, stage: ShuffleQueryStageExec) if 
> isSupported(stage.shuffle) =>
> p.copy(child = tryOptimizeSkewedPartitions(stage))
>   case stage: ShuffleQueryStageExec if isSupported(stage.shuffle) =>
> tryOptimizeSkewedPartitions(stage)
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47294) OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec)

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47294:
---
Labels: pull-request-available  (was: )

> OptimizeSkewInRebalanceRepartitions should support 
> ProjectExec(_,ShuffleQueryStageExec)
> ---
>
> Key: SPARK-47294
> URL: https://issues.apache.org/jira/browse/SPARK-47294
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: angerszhu
>Priority: Major
>  Labels: pull-request-available
>
> Current OptimizeSkewInRebalanceRepartitions only support match 
> ShuffleQueryStageExec, this case only support SQL query, can't work when 
> insert since there have a project between ShuffleQueryStageExec and insert 
> command
> {code:java}
> plan transformUp {
>   case p @ ProjectExec(_, stage: ShuffleQueryStageExec) if 
> isSupported(stage.shuffle) =>
> p.copy(child = tryOptimizeSkewedPartitions(stage))
>   case stage: ShuffleQueryStageExec if isSupported(stage.shuffle) =>
> tryOptimizeSkewedPartitions(stage)
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47294) OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec)

2024-03-05 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-47294:
--
Description: 
Current OptimizeSkewInRebalanceRepartitions only support match 
ShuffleQueryStageExec, this case only support SQL query, can't work when insert 
since there have a project between ShuffleQueryStageExec and insert command
{code:java}
plan transformUp {
  case p @ ProjectExec(_, stage: ShuffleQueryStageExec) if 
isSupported(stage.shuffle) =>
p.copy(child = tryOptimizeSkewedPartitions(stage))
  case stage: ShuffleQueryStageExec if isSupported(stage.shuffle) =>
tryOptimizeSkewedPartitions(stage)
} {code}

> OptimizeSkewInRebalanceRepartitions should support 
> ProjectExec(_,ShuffleQueryStageExec)
> ---
>
> Key: SPARK-47294
> URL: https://issues.apache.org/jira/browse/SPARK-47294
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: angerszhu
>Priority: Major
>
> Current OptimizeSkewInRebalanceRepartitions only support match 
> ShuffleQueryStageExec, this case only support SQL query, can't work when 
> insert since there have a project between ShuffleQueryStageExec and insert 
> command
> {code:java}
> plan transformUp {
>   case p @ ProjectExec(_, stage: ShuffleQueryStageExec) if 
> isSupported(stage.shuffle) =>
> p.copy(child = tryOptimizeSkewedPartitions(stage))
>   case stage: ShuffleQueryStageExec if isSupported(stage.shuffle) =>
> tryOptimizeSkewedPartitions(stage)
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47294) OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec)

2024-03-05 Thread angerszhu (Jira)
angerszhu created SPARK-47294:
-

 Summary: OptimizeSkewInRebalanceRepartitions should support 
ProjectExec(_,ShuffleQueryStageExec)
 Key: SPARK-47294
 URL: https://issues.apache.org/jira/browse/SPARK-47294
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1, 4.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19256) Hive bucketing write support

2024-03-05 Thread Shreyas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823851#comment-17823851
 ] 

Shreyas commented on SPARK-19256:
-

Don't really like spamming in the comments - but this is a much needed feature 
for big-data processing, and has been pending for a while now. Can this be 
given some love please? :D

> Hive bucketing write support
> 
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Tejas Patil
>Priority: Minor
>
> Update (2020 by Cheng Su):
> We use this JIRA to track progress for Hive bucketing write support in Spark. 
> The goal is for Spark to write Hive bucketed table, to be compatible with 
> other compute engines (Hive and Presto).
>  
> Current status for Hive bucketed table in Spark:
> Not support for reading Hive bucketed table: read bucketed table as 
> non-bucketed table.
> Wrong behavior for writing Hive ORC and Parquet bucketed table: write 
> orc/parquet bucketed table as non-bucketed table (code path: 
> InsertIntoHadoopFsRelationCommand -> FileFormatWriter).
> Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception 
> by default if writing non-orc/parquet bucketed table (code path: 
> InsertIntoHiveTable), and exception can be disabled by setting config 
> `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will 
> write as non-bucketed table.
>  
> Current status for Hive bucketed table in Hive:
> Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash 
> (https://issues.apache.org/jira/browse/HIVE-18910).
> Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
> Hive on Tez: support zero and multiple files per bucket 
> (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on 
> read path - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212]
>  .
>  
> Current status for Hive bucketed table in Presto (take presto-sql here):
> Support writing bucketed table with Hive murmur3hash and hivehash 
> ([https://github.com/prestosql/presto/pull/1697]).
> Support zero and multiple files per bucket 
> ([https://github.com/prestosql/presto/pull/822]).
>  
> TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and 
> Hive. Here with this JIRA, we need to add support writing Hive bucketed table 
> with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 
> 2.x.y).
>  
> To allow Spark efficiently read Hive bucketed table, this needs more radical 
> change and we decide to wait until data source v2 supports bucketing, and do 
> the read path on data source v2. Read path will not covered by this JIRA.
>  
> Original description (2017 by Tejas Patil):
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47293) Build batchSchema with total sparkSchema instead of append one by one

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47293:
---
Labels: pull-request-available  (was: )

> Build batchSchema with total sparkSchema instead of append one by one
> -
>
> Key: SPARK-47293
> URL: https://issues.apache.org/jira/browse/SPARK-47293
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Binjie Yang
>Priority: Minor
>  Labels: pull-request-available
>
> We can simply init batchSchema with whole sparkSchema instead of append one 
> by one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47293) Build batchSchema with total sparkSchema instead of append one by one

2024-03-05 Thread Binjie Yang (Jira)
Binjie Yang created SPARK-47293:
---

 Summary: Build batchSchema with total sparkSchema instead of 
append one by one
 Key: SPARK-47293
 URL: https://issues.apache.org/jira/browse/SPARK-47293
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Binjie Yang


We can simply init batchSchema with whole sparkSchema instead of append one by 
one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47280) Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE

2024-03-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47280.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45384
[https://github.com/apache/spark/pull/45384]

> Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE
> -
>
> Key: SPARK-47280
> URL: https://issues.apache.org/jira/browse/SPARK-47280
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47247) use smaller target size when coalescing partitions with exploding joins

2024-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47247.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45357
[https://github.com/apache/spark/pull/45357]

> use smaller target size when coalescing partitions with exploding joins
> ---
>
> Key: SPARK-47247
> URL: https://issues.apache.org/jira/browse/SPARK-47247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47247) use smaller target size when coalescing partitions with exploding joins

2024-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47247:
---

Assignee: Wenchen Fan

> use smaller target size when coalescing partitions with exploding joins
> ---
>
> Key: SPARK-47247
> URL: https://issues.apache.org/jira/browse/SPARK-47247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40763) Should expose driver service name to config for user features

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-40763:
---
Labels: pull-request-available  (was: )

> Should expose driver service name to config for user features
> -
>
> Key: SPARK-40763
> URL: https://issues.apache.org/jira/browse/SPARK-40763
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: binjie yang
>Priority: Minor
>  Labels: pull-request-available
>
> Current on kubernetes, user's feature step, which build user's kubernetes 
> resource during spark submit spark pod, can't percept some spark resource 
> info, such as spark driver service name.
>  
> User may want to expose some spark pod info to build their custom resource, 
> such as ingress, etc.
>  
> We want the way expose now spark driver service name, which is now generated 
> by clock and uuid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47146) Possible thread leak when doing sort merge join

2024-03-05 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-47146:

Fix Version/s: 3.5.2
   3.4.3

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Assignee: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
> data in the stream is read through.
> {code:java}
> @Override
> public void loadNext() throws IOException {
>   // Kill the task in case it has been marked as killed. This logic is from
>   // InterruptibleIterator, but we inline it here instead of wrapping the 
> iterator in order
>   // to avoid performance overhead. This check is ad

[jira] [Commented] (SPARK-47146) Possible thread leak when doing sort merge join

2024-03-05 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823832#comment-17823832
 ] 

Mridul Muralidharan commented on SPARK-47146:
-

Backported to 3.5 and 3.4 in PR: https://github.com/apache/spark/pull/45390

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Assignee: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
> data in the stream is read through.
> {code:java}
> @Override
> public void loadNext() throws IOException {
>   // Kill the task in case it has been marked as killed. This logic is from
>   // InterruptibleIterator, but we inline it here instead of wrap

[jira] [Resolved] (SPARK-47285) AdaptiveSparkPlanExec should always use the context.session

2024-03-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47285.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45388
[https://github.com/apache/spark/pull/45388]

> AdaptiveSparkPlanExec should always use the context.session
> ---
>
> Key: SPARK-47285
> URL: https://issues.apache.org/jira/browse/SPARK-47285
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47285) AdaptiveSparkPlanExec should always use the context.session

2024-03-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47285:


Assignee: XiDuo You

> AdaptiveSparkPlanExec should always use the context.session
> ---
>
> Key: SPARK-47285
> URL: https://issues.apache.org/jira/browse/SPARK-47285
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47292) safeMapToJValue should consider when map is null

2024-03-05 Thread Wei Liu (Jira)
Wei Liu created SPARK-47292:
---

 Summary: safeMapToJValue should consider when map is null
 Key: SPARK-47292
 URL: https://issues.apache.org/jira/browse/SPARK-47292
 Project: Spark
  Issue Type: New Feature
  Components: Connect, SS
Affects Versions: 3.5.1, 4.0.0
Reporter: Wei Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44746) Improve the documentation for TABLE input arguments for UDTFs

2024-03-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44746.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45375
[https://github.com/apache/spark/pull/45375]

> Improve the documentation for TABLE input arguments for UDTFs
> -
>
> Key: SPARK-44746
> URL: https://issues.apache.org/jira/browse/SPARK-44746
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We should add more examples for using Python UDTFs with TABLE arguments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44746) Improve the documentation for TABLE input arguments for UDTFs

2024-03-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44746:


Assignee: Daniel

> Improve the documentation for TABLE input arguments for UDTFs
> -
>
> Key: SPARK-44746
> URL: https://issues.apache.org/jira/browse/SPARK-44746
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
>
> We should add more examples for using Python UDTFs with TABLE arguments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF

2024-03-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47277.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/45380

> PySpark util function assertDataFrameEqual should not support streaming DF
> --
>
> Key: SPARK-47277
> URL: https://issues.apache.org/jira/browse/SPARK-47277
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, SQL, Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47291) Add Parquet file reader metrics to scan

2024-03-05 Thread Parth Chandra (Jira)
Parth Chandra created SPARK-47291:
-

 Summary: Add Parquet file reader metrics to scan
 Key: SPARK-47291
 URL: https://issues.apache.org/jira/browse/SPARK-47291
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Parth Chandra


With the addition of external metrics support in Parquet 
([PARQUET-2374|https://issues.apache.org/jira/browse/PARQUET-2374]), it is now 
possible to gather file level scan metrics and have them displayed in a Parquet 
Scan's metrics. This can be done for both DSV1 and DSV2 implementations by 
providing a ParquetMetricsCallback implementation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47290) Extend CustomTaskMetric to allow metric values from multiple sources

2024-03-05 Thread Parth Chandra (Jira)
Parth Chandra created SPARK-47290:
-

 Summary: Extend CustomTaskMetric to allow metric values from 
multiple sources
 Key: SPARK-47290
 URL: https://issues.apache.org/jira/browse/SPARK-47290
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Parth Chandra


Custom task metrics allow a DSV2 to add metrics that can be displayed in the 
UI. However, for DSV2 file sources, the FilePartitionReader may have multiple 
file readers, and each of these may report their metrics which need to be 
aggregated and bubbled up to the Scan. There is no way currently to update a 
metric value.
A new interface that extends CustomTaskMetric and defines a way to allow 
updates would allow a DSV2 file scan implementation to over come this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2024-03-05 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823510#comment-17823510
 ] 

Asif edited comment on SPARK-33152 at 3/5/24 6:43 PM:
--

[~tedjenks] .. Unfortunately I am not a committer. As part of workday , I had 
opened this Jira and opened a PR to fix this issue completely which required a 
different logic. The changes are extensive and they were never reviewed or 
dicussed by OS community. This PR has been in production since past 3 years at 
Workday. 

As to why a check is not added, etc,.,:

That would be unclean and as such is not easy to implement also in current 
codebase, because it will result in various other issues like new redundant 
filters being inferred and other messy bugs as the constraint code is sensitive 
to constraints coming from each node below and the constraints available at 
current node, to decide whether to create new filters or not.

Constrainst are created per operator node ( project, filter etc) and arbitrary 
putting a limit on constraints at a given operator , will impact the new 
filters being created.


was (Author: ashahid7):
[~tedjenks] .. Unfortunately I am not a committer. As part of workday , I had 
opened this Jira and opened a PR to fix this issue completely which required a 
different logic. The changes are extensive and they were never reviewed or 
dicussed by OS community. This PR has been in production since past 3 years at 
Workday. 

As to why a check is not added, etc,.,:

That would be unclean and as such is not easy to implement also in current 
codebase, because it will result in various other issues like new/wrong filters 
being inferred and other messy bugs as the constraint code is sensitive to 
constraints coming from each node below and the constraints available at 
current node, to decide whether to create new filters or not.

Constrainst are created per operator node ( project, filter etc) and arbitrary 
putting a limit on constraints at a given operator , will impact the new 
filters being created.

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints beca

[jira] [Created] (SPARK-47289) Allow extensions to log extended information in explain plan

2024-03-05 Thread Parth Chandra (Jira)
Parth Chandra created SPARK-47289:
-

 Summary: Allow extensions to log extended information in explain 
plan
 Key: SPARK-47289
 URL: https://issues.apache.org/jira/browse/SPARK-47289
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Parth Chandra


With session extensions, Spark planning can be extended to apply additional 
rules and modify the execution plan. If an extension replaces a node in the 
plan, the new node will be displayed in the plan. However, it is sometimes 
useful for extensions provided extended information to the end user to explain 
the impact of the extension. For instance an extension may automatically 
enable/disable some feature that it provides and can provide this extended 
information in the plan. 
The proposal is to optionally turn on extended plan information from 
extensions. Extensions can add additional planning information via a new 
interface that internally uses a new TreeNodeTag, say 'explainPlan'.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47033) EXECUTE IMMEDIATE USING does not recognize session variable names

2024-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47033.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45293
[https://github.com/apache/spark/pull/45293]

> EXECUTE IMMEDIATE USING does not recognize session variable names
> -
>
> Key: SPARK-47033
> URL: https://issues.apache.org/jira/browse/SPARK-47033
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {noformat}
> DECLARE parm = 'Hello';
> EXECUTE IMMEDIATE 'SELECT :parm' USING parm;
> [ALL_PARAMETERS_MUST_BE_NAMED] Using name parameterized queries requires all 
> parameters to be named. Parameters missing names: "parm". SQLSTATE: 07001
> EXECUTE IMMEDIATE 'SELECT :parm' USING parm AS parm;
> Hello
> {noformat}
> variables are like column references, they act as their own aliases and thus 
> should not be required to be named to associate with a named parameter with 
> the same name.
> Note that unlike for pySpark this should be case insensitive (haven't 
> verified).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43883) CTAS Command Nodes Prevent Some Optimizer Rules From Running

2024-03-05 Thread Ted Chester Jenks (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Chester Jenks resolved SPARK-43883.
---
Resolution: Won't Fix

> CTAS Command Nodes Prevent Some Optimizer Rules From Running
> 
>
> Key: SPARK-43883
> URL: https://issues.apache.org/jira/browse/SPARK-43883
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Ted Chester Jenks
>Priority: Major
> Attachments: Not Working - Create Table.png, Working - 3.2.0.png, 
> Working - No Create Table.png
>
>
> The changes introduced to resolve SPARK-41713 in 
> [https://github.com/apache/spark/pull/39220] modified the CTAS commands from 
> having a `DataWritingCommand` trait to a `LeafRunnableCommand` trait. The 
> `DataWritingCommand` trait extends `UnaryCommand`, and has children set to 
> the value of query in the CTAS command. This means that when `transform` is 
> called to traverse the tree with the CTAS command at the root, the entire 
> query is traversed. `LeafRunnableCommand` has a `LeafLike` trait which 
> explicitly sets the value of children to `Nil`. This means that when 
> `transform` is called on the command, no children are found and the query is 
> unaffected by the rule.
> In practice, this means that optimizer rules that rely on `transform` (such 
> as `BooleanSimplification`) to traverse the tree do not work with a CTAS. 
> This can be demonstrated with a simple query in spark-shell. Without the CTAS 
> we can run a command with an easily simplified boolean expression (`id == 9 
> && id == 9`) and see it gets optimized out:
> !Working - No Create Table.png|width=883,height=342!
> With a CTAS, the optimisation does not get applied (as we can see from the 
> `AND` still present in the optimized and physical plans):
> !Not Working - Create Table.png|width=885,height=524!
> This works in 3.2.0 which had the old CTAS implementation:
> !Working - 3.2.0.png|width=885,height=345!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47288) DataType __repr__ change breaks datatype checking (anit-)pattern

2024-03-05 Thread Ted Chester Jenks (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823717#comment-17823717
 ] 

Ted Chester Jenks commented on SPARK-47288:
---

[~gurwls223] I saw you on the original PR, curious for you thoughts.

> DataType __repr__ change breaks datatype checking (anit-)pattern
> 
>
> Key: SPARK-47288
> URL: https://issues.apache.org/jira/browse/SPARK-47288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Ted Chester Jenks
>Priority: Major
>
> This pr: [https://github.com/apache/spark/pull/34320]
> Made reprs for datatype eval-able. This is kind of nice, but we have a ton of 
> users doing stuff like:
>  
> {code:java}
> if str(data_type) == "StringType":
>    ...
> {code}
>  
> Which breaks.
>  
> What would people think of adding a __str__ to the base class that returns 
> the old behaviour so we can have the best of both worlds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47288) DataType __repr__ change breaks datatype checking (anit-)pattern

2024-03-05 Thread Ted Chester Jenks (Jira)
Ted Chester Jenks created SPARK-47288:
-

 Summary: DataType __repr__ change breaks datatype checking 
(anit-)pattern
 Key: SPARK-47288
 URL: https://issues.apache.org/jira/browse/SPARK-47288
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1
Reporter: Ted Chester Jenks


This pr: [https://github.com/apache/spark/pull/34320]

Made reprs for datatype eval-able. This is kind of nice, but we have a ton of 
users doing stuff like:

 
{code:java}
if str(data_type) == "StringType":
   ...
{code}
 

Which breaks.

 

What would people think of adding a __str__ to the base class that returns the 
old behaviour so we can have the best of both worlds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-03-05 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823713#comment-17823713
 ] 

Nicholas Chammas commented on SPARK-46810:
--

[~cloud_fan], [~LuciferYang], [~beliefer], and [~dongjoon] - Friendly ping.

Any thoughts on how to resolve the inconsistent error terminology?

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It aligns most closely to the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a "class" to a "category" is low impact and 
> may not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>  * These terms do not align with the SQL standard.
>  * We will have to retire the term "error condition", which we have [already 
> used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
>  in user-facing documentation.
> h1. Option 3: "Error Class" and "State Class"
>  * SQL state class: 42
>  * SQL state sub-class: K01
>  * SQL state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The chang

[jira] [Updated] (SPARK-47281) Update the `versions. json` file for the already released spark version

2024-03-05 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-47281:

Summary: Update the `versions. json` file for the already released spark 
version  (was: Update the `versions. json` file for the already released saprk 
version)

> Update the `versions. json` file for the already released spark version
> ---
>
> Key: SPARK-47281
> URL: https://issues.apache.org/jira/browse/SPARK-47281
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47287) Aggregate in not causes

2024-03-05 Thread Ted Chester Jenks (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Chester Jenks updated SPARK-47287:
--
Description: 
 

The below snippet is confirmed working with Spark 3.2.1 and broken Spark 3.4.1. 
i believe this is a bug. 
{code:java}
   Dataset ds = dummyDataset
.withColumn("flag", 
functions.not(functions.coalesce(functions.col("bool1"), 
functions.lit(false)).equalTo(true)))
.groupBy("code")
.agg(functions.max(functions.col("flag")).alias("flag"));
ds.show(); {code}
It fails with:
{code:java}
Caused by: java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.$anonfun$generateExpression$7(V2ExpressionBuilder.scala:185)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:184)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33)
at 
org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateAggregateFunc(V2ExpressionBuilder.scala:293)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:98)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33)
at 
org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.translate$1(DataSourceStrategy.scala:700){code}
 

 

  was:
 

The below snippet is confirmed working with Spark 3.2.1 and broken Spark 3.4.1. 
i believe this is a bug. 
{code:java}
   Dataset ds = dummyDataset
.withColumn("flag", 
functions.not(functions.coalesce(functions.col("bool1"), 
functions.lit(false)).equalTo(true)))
.groupBy("code")
.agg(functions.max(functions.col("flag")).alias("flag"));
ds.show(); {code}
It fails with:

 

 
{code:java}
Caused by: java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.$anonfun$generateExpression$7(V2ExpressionBuilder.scala:185)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:184)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33)
at 
org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateAggregateFunc(V2ExpressionBuilder.scala:293)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:98)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33)
at 
org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.translate$1(DataSourceStrategy.scala:700)
 {code}
 

 


> Aggregate in not causes 
> 
>
> Key: SPARK-47287
> URL: https://issues.apache.org/jira/browse/SPARK-47287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Ted Chester Jenks
>Priority: Major
>
>  
> The below snippet is confirmed working with Spark 3.2.1 and broken Spark 
> 3.4.1. i believe this is a bug. 
> {code:java}
>Dataset ds = dummyDataset
> .withColumn("flag", 
> functions.not(functions.coalesce(functions.col("bool1"), 
> functions.lit(false)).equalTo(true)))
> .groupBy("code")
> .agg(functions.max(functions.col("flag")).alias("flag"));
> ds.show(); {code}
> It fails with:
> {code:java}
> Caused by: java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:208)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.$anonfun$generateExpression$7(V2ExpressionBuilder.scala:185)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:184)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33)
>   at 
> org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803)
>   at 
> org.apache.spark.sql.

[jira] [Created] (SPARK-47287) Aggregate in not causes

2024-03-05 Thread Ted Chester Jenks (Jira)
Ted Chester Jenks created SPARK-47287:
-

 Summary: Aggregate in not causes 
 Key: SPARK-47287
 URL: https://issues.apache.org/jira/browse/SPARK-47287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1
Reporter: Ted Chester Jenks


 

The below snippet is confirmed working with Spark 3.2.1 and broken Spark 3.4.1. 
i believe this is a bug. 
{code:java}
   Dataset ds = dummyDataset
.withColumn("flag", 
functions.not(functions.coalesce(functions.col("bool1"), 
functions.lit(false)).equalTo(true)))
.groupBy("code")
.agg(functions.max(functions.col("flag")).alias("flag"));
ds.show(); {code}
It fails with:

 

 
{code:java}
Caused by: java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.$anonfun$generateExpression$7(V2ExpressionBuilder.scala:185)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:184)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33)
at 
org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateAggregateFunc(V2ExpressionBuilder.scala:293)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:98)
at 
org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33)
at 
org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.translate$1(DataSourceStrategy.scala:700)
 {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47198) Is it possible to dynamically add backend service to ingress with Kubernetes?

2024-03-05 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823616#comment-17823616
 ] 

Kent Yao commented on SPARK-47198:
--

Hi [~melin], it would be better to ask your questions via the mailing list so 
that more people can see them and provide their attention to it.

> Is it possible to dynamically add backend service to ingress with Kubernetes?
> -
>
> Key: SPARK-47198
> URL: https://issues.apache.org/jira/browse/SPARK-47198
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: melin
>Priority: Major
>
> spark on k8s runs multiple spark apps at the same time. proxy/[sparkappid] 
> path, forwarding to different sparkapp ui console based on sparkappid. spark 
> apps are dynamically added and decreased. ingress Dynamically adds spark svc.
> [sparkappid]_svc == spark svc name
> [https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-ingress-guide-nginx-example.html]
> [~Qin Yao] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47198) Is it possible to dynamically add backend service to ingress with Kubernetes?

2024-03-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47198.
--
Resolution: Information Provided

> Is it possible to dynamically add backend service to ingress with Kubernetes?
> -
>
> Key: SPARK-47198
> URL: https://issues.apache.org/jira/browse/SPARK-47198
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: melin
>Priority: Major
>
> spark on k8s runs multiple spark apps at the same time. proxy/[sparkappid] 
> path, forwarding to different sparkapp ui console based on sparkappid. spark 
> apps are dynamically added and decreased. ingress Dynamically adds spark svc.
> [sparkappid]_svc == spark svc name
> [https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-ingress-guide-nginx-example.html]
> [~Qin Yao] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47168) Disable parquet filter pushdown for non default collated strings

2024-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47168.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45262
[https://github.com/apache/spark/pull/45262]

> Disable parquet filter pushdown for non default collated strings
> 
>
> Key: SPARK-47168
> URL: https://issues.apache.org/jira/browse/SPARK-47168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"

2024-03-05 Thread TianyiMa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TianyiMa updated SPARK-47279:
-
Description: 
we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

After detailed analysis, we found that, the driver submitted task 0.0 at 
"16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", 
then executor 4 sent results to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native thread" to handle the successful result of task 0.0, then 
the driver think task 0.0 has not finished and waiting for the "missed result" 
forever.

 

driver submit task 0.0

!driver_submit_task.png!

 

executor 4 task 0.0

!executor_4.png!

 

oom-killer:

!oom-killer.png!

  was:
we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

After detailed analysis, we found that, the driver submitted task 0.0 at 
"16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", 
then executor 4 sent results to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native 

[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"

2024-03-05 Thread TianyiMa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TianyiMa updated SPARK-47279:
-
Attachment: oom-killer.png

> spark driver process hangs due to "unable to create new native thread"
> --
>
> Key: SPARK-47279
> URL: https://issues.apache.org/jira/browse/SPARK-47279
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 3.1.1, 3.5.0
>Reporter: TianyiMa
>Priority: Major
>  Labels: pull-request-available
> Attachments: driver_submit_task.png, executor_4.png, oom-killer.png
>
>
> we encounter that spark driver hangs for about 11 hours,  and finall killed 
> by user. In the driver log there is an error log: 
> {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
> happened while processing message in the inbox for CoarseGrainedScheduler
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:719)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
>         at 
> org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
>         at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
>         at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>         at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>         at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> {quote}
>  
> After detailed analysis, we found that, the driver submitted task 0.0 at 
> "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", 
> then executor 4 sent results to the driver. But in the same time, there is 
> not sufficient memory in the the server that running the driver, the driver 
> "unable to create new native thread" to handle the successful result of task 
> 0.0, then the driver think task 0.0 has not finished and waiting for the 
> "missed result" forever.
>  
> driver submit task 0.0
> !driver_submit_task.png!
>  
> executor 4 task 0.0
> !executor_4.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-03-05 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-47102:


Assignee: Mihailo Milosevic

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-03-05 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-47102.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45285
[https://github.com/apache/spark/pull/45285]

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46835) Join support for strings with collation

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46835:
---
Labels: pull-request-available  (was: )

> Join support for strings with collation
> ---
>
> Key: SPARK-46835
> URL: https://issues.apache.org/jira/browse/SPARK-46835
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47239) Support distinct window function

2024-03-05 Thread Mingliang Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingliang Zhu updated SPARK-47239:
--
Issue Type: New Feature  (was: Bug)

> Support distinct window function
> 
>
> Key: SPARK-47239
> URL: https://issues.apache.org/jira/browse/SPARK-47239
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Mingliang Zhu
>Priority: Major
>  Labels: pull-request-available
>
> Support distinct window function when window frame is entire partition frame 
> or
>  growing frame.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47286) IN operator support

2024-03-05 Thread Aleksandar Tomic (Jira)
Aleksandar Tomic created SPARK-47286:


 Summary: IN operator support
 Key: SPARK-47286
 URL: https://issues.apache.org/jira/browse/SPARK-47286
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Aleksandar Tomic


At this point following query works fine:
```
 sql("select * from t1 where ucs_basic_lcase in ('aaa' collate 
'ucs_basic_lcase', 'bbb' collate 'ucs_basic_lcase')").show()

```

But if we were to miss explicit collate or even mix collations:
```
  sql("select * from t1 where ucs_basic_lcase in ('aaa' collate 
'ucs_basic_lcase', 'bbb'").show()

```

Query would still run and return invalid results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47285) AdaptiveSparkPlanExec should always use the context.session

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47285:
---
Labels: pull-request-available  (was: )

> AdaptiveSparkPlanExec should always use the context.session
> ---
>
> Key: SPARK-47285
> URL: https://issues.apache.org/jira/browse/SPARK-47285
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47285) AdaptiveSparkPlanExec should always use the context.session

2024-03-05 Thread XiDuo You (Jira)
XiDuo You created SPARK-47285:
-

 Summary: AdaptiveSparkPlanExec should always use the 
context.session
 Key: SPARK-47285
 URL: https://issues.apache.org/jira/browse/SPARK-47285
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45387) Partition key filter cannot be pushed down when using cast

2024-03-05 Thread TianyiMa (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823524#comment-17823524
 ] 

TianyiMa commented on SPARK-45387:
--

[~doki] the output execution plan is the final result, but the problem lies in 
the optimize process.

In your example, the partition key is stringType, but was cast to int to filter 
partitions. The driver will get all the partition to do this filter. If you 
have a hive table with thousands of partitions, this process will very slow and 
costly.

> Partition key filter cannot be pushed down when using cast
> --
>
> Key: SPARK-45387
> URL: https://issues.apache.org/jira/browse/SPARK-45387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0
>Reporter: TianyiMa
>Priority: Critical
> Attachments: PruneFileSourcePartitions.diff
>
>
> Suppose we have a partitioned table `table_pt` with partition colum `dt` 
> which is StringType and the table metadata is managed by Hive Metastore, if 
> we filter partition by dt = '123', this filter can be pushed down to data 
> source, but if the filter condition is number, e.g. dt = 123, that cannot be 
> pushed down to data source, causing spark to pull all of that table's 
> partition meta data to client, which is poor of performance if the table has 
> thousands of partitions and increasing the risk of hive metastore oom.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44259) Make `connect-jvm-client` module pass except arrow-related ones in Java 21

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44259:
---
Labels: pull-request-available  (was: )

> Make `connect-jvm-client` module pass except arrow-related ones in Java 21
> --
>
> Key: SPARK-44259
> URL: https://issues.apache.org/jira/browse/SPARK-44259
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2024-03-05 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823512#comment-17823512
 ] 

Asif commented on SPARK-33152:
--

other than using my PR, the safe option would be to disable constraint 
propagation rule via sql conf. though that would mean loosing optimizations 
related to push down of new filters on the other side of join legs etc,

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column). There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours. 
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predic

[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2024-03-05 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823510#comment-17823510
 ] 

Asif commented on SPARK-33152:
--

[~tedjenks] .. Unfortunately I am not a committer. As part of workday , I had 
opened this Jira and opened a PR to fix this issue completely which required a 
different logic. The changes are extensive and they were never reviewed or 
dicussed by OS community. This PR has been in production since past 3 years at 
Workday. 

As to why a check is not added, etc,.,:

That would be unclean and as such is not easy to implement also in current 
codebase, because it will result in various other issues like new/wrong filters 
being inferred and other messy bugs as the constraint code is sensitive to 
constraints coming from each node below and the constraints available at 
current node, to decide whether to create new filters or not.

Constrainst are created per operator node ( project, filter etc) and arbitrary 
putting a limit on constraints at a given operator , will impact the new 
filters being created.

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to gene

[jira] [Updated] (SPARK-47284) We should ensure enough parallelism when ShuffleExchangeLike join with specs without shuffle

2024-03-05 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated SPARK-47284:
---
Description: 
The following case is introduced by 
https://issues.apache.org/jira/browse/SPARK-35703

// When choosing specs, we should consider those children with no 
`ShuffleExchangeLike` node
// first. For instance, if we have:
// A: (No_Exchange, 100) <---> B: (Exchange, 120)
// it's better to pick A and change B to (Exchange, 100) instead of picking B 
and insert a
// new shuffle for A.


*But we'd better improve it in some cases, for example:*
A: (No_Exchange, 2) <---> B: (Exchange, 100)

The current logic will change to:
A: (No_Exchange, 2) <---> B: (Exchange,2)

It actually not ensure enough parallelism, it will reduce the performance i 
think.

  was:
The following case is introduced by 
https://issues.apache.org/jira/browse/SPARK-35703


// When choosing specs, we should consider those children with no 
`ShuffleExchangeLike` node
// first. For instance, if we have:
// A: (No_Exchange, 100) <---> B: (Exchange, 120)
// it's better to pick A and change B to (Exchange, 100) instead of picking B 
and insert a
// new shuffle for A.



But we'd better improve it in some cases, for example:
A: (No_Exchange, 2) <---> B: (Exchange, 100)


The current logic will change to:
A: (No_Exchange, 2) <---> B: (Exchange,2)

It actually not ensure enough parallelism, it will reduce the performance i 
think.


> We should ensure enough parallelism when ShuffleExchangeLike join with specs 
> without shuffle
> 
>
> Key: SPARK-47284
> URL: https://issues.apache.org/jira/browse/SPARK-47284
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Qi Zhu
>Priority: Major
>
> The following case is introduced by 
> https://issues.apache.org/jira/browse/SPARK-35703
> // When choosing specs, we should consider those children with no 
> `ShuffleExchangeLike` node
> // first. For instance, if we have:
> // A: (No_Exchange, 100) <---> B: (Exchange, 120)
> // it's better to pick A and change B to (Exchange, 100) instead of picking B 
> and insert a
> // new shuffle for A.
> *But we'd better improve it in some cases, for example:*
> A: (No_Exchange, 2) <---> B: (Exchange, 100)
> The current logic will change to:
> A: (No_Exchange, 2) <---> B: (Exchange,2)
> It actually not ensure enough parallelism, it will reduce the performance i 
> think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47284) We should ensure enough parallelism when ShuffleExchangeLike join with specs without shuffle

2024-03-05 Thread Qi Zhu (Jira)
Qi Zhu created SPARK-47284:
--

 Summary: We should ensure enough parallelism when 
ShuffleExchangeLike join with specs without shuffle
 Key: SPARK-47284
 URL: https://issues.apache.org/jira/browse/SPARK-47284
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Qi Zhu


The following case is introduced by 
https://issues.apache.org/jira/browse/SPARK-35703


// When choosing specs, we should consider those children with no 
`ShuffleExchangeLike` node
// first. For instance, if we have:
// A: (No_Exchange, 100) <---> B: (Exchange, 120)
// it's better to pick A and change B to (Exchange, 100) instead of picking B 
and insert a
// new shuffle for A.



But we'd better improve it in some cases, for example:
A: (No_Exchange, 2) <---> B: (Exchange, 100)


The current logic will change to:
A: (No_Exchange, 2) <---> B: (Exchange,2)

It actually not ensure enough parallelism, it will reduce the performance i 
think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47277:
--

Assignee: (was: Apache Spark)

> PySpark util function assertDataFrameEqual should not support streaming DF
> --
>
> Key: SPARK-47277
> URL: https://issues.apache.org/jira/browse/SPARK-47277
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, SQL, Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Liu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47283) Remove Spark version drop down to the PySpark doc site

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47283:
--

Assignee: (was: Apache Spark)

> Remove Spark version drop down to the PySpark doc site
> --
>
> Key: SPARK-47283
> URL: https://issues.apache.org/jira/browse/SPARK-47283
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47283) Remove Spark version drop down to the PySpark doc site

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47283:
--

Assignee: Apache Spark

> Remove Spark version drop down to the PySpark doc site
> --
>
> Key: SPARK-47283
> URL: https://issues.apache.org/jira/browse/SPARK-47283
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47277:
--

Assignee: Apache Spark

> PySpark util function assertDataFrameEqual should not support streaming DF
> --
>
> Key: SPARK-47277
> URL: https://issues.apache.org/jira/browse/SPARK-47277
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, SQL, Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Liu
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47283) Remove Spark version drop down to the PySpark doc site

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47283:
---
Labels: pull-request-available  (was: )

> Remove Spark version drop down to the PySpark doc site
> --
>
> Key: SPARK-47283
> URL: https://issues.apache.org/jira/browse/SPARK-47283
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47277:
--

Assignee: (was: Apache Spark)

> PySpark util function assertDataFrameEqual should not support streaming DF
> --
>
> Key: SPARK-47277
> URL: https://issues.apache.org/jira/browse/SPARK-47277
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, SQL, Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Liu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47277:
--

Assignee: Apache Spark

> PySpark util function assertDataFrameEqual should not support streaming DF
> --
>
> Key: SPARK-47277
> URL: https://issues.apache.org/jira/browse/SPARK-47277
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, SQL, Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Liu
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2024-03-05 Thread Ted Chester Jenks (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823506#comment-17823506
 ] 

Ted Chester Jenks commented on SPARK-33152:
---

[~ashahid7] I see. This is very painful for us because a bunch of builds that 
used to work are now hanging indefinitely. Is there any reason there was never 
a check added to see if the set of constraints was getting too large, or 
perhaps an optional config you could use to set a max number of constraints?

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column). There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours. 
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of Equal

[jira] [Created] (SPARK-47283) Remove Spark version drop down to the PySpark doc site

2024-03-05 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-47283:
---

 Summary: Remove Spark version drop down to the PySpark doc site
 Key: SPARK-47283
 URL: https://issues.apache.org/jira/browse/SPARK-47283
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 3.5.1, 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47277:
--

Assignee: (was: Apache Spark)

> PySpark util function assertDataFrameEqual should not support streaming DF
> --
>
> Key: SPARK-47277
> URL: https://issues.apache.org/jira/browse/SPARK-47277
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, SQL, Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Liu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47277:
--

Assignee: Apache Spark

> PySpark util function assertDataFrameEqual should not support streaming DF
> --
>
> Key: SPARK-47277
> URL: https://issues.apache.org/jira/browse/SPARK-47277
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, SQL, Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Liu
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47277:
--

Assignee: (was: Apache Spark)

> PySpark util function assertDataFrameEqual should not support streaming DF
> --
>
> Key: SPARK-47277
> URL: https://issues.apache.org/jira/browse/SPARK-47277
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, SQL, Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Liu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47281) Update the `versions. json` file for the already released saprk version

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47281:
--

Assignee: (was: Apache Spark)

> Update the `versions. json` file for the already released saprk version
> ---
>
> Key: SPARK-47281
> URL: https://issues.apache.org/jira/browse/SPARK-47281
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47277:
--

Assignee: Apache Spark

> PySpark util function assertDataFrameEqual should not support streaming DF
> --
>
> Key: SPARK-47277
> URL: https://issues.apache.org/jira/browse/SPARK-47277
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, SQL, Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Liu
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47281) Update the `versions. json` file for the already released saprk version

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47281:
--

Assignee: Apache Spark

> Update the `versions. json` file for the already released saprk version
> ---
>
> Key: SPARK-47281
> URL: https://issues.apache.org/jira/browse/SPARK-47281
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47281) Update the `versions. json` file for the already released saprk version

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47281:
--

Assignee: Apache Spark

> Update the `versions. json` file for the already released saprk version
> ---
>
> Key: SPARK-47281
> URL: https://issues.apache.org/jira/browse/SPARK-47281
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47281) Update the `versions. json` file for the already released saprk version

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47281:
--

Assignee: (was: Apache Spark)

> Update the `versions. json` file for the already released saprk version
> ---
>
> Key: SPARK-47281
> URL: https://issues.apache.org/jira/browse/SPARK-47281
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47148) Avoid to materialize AQE ExchangeQueryStageExec on the cancellation

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47148:
--

Assignee: (was: Apache Spark)

> Avoid to materialize AQE ExchangeQueryStageExec on the cancellation
> ---
>
> Key: SPARK-47148
> URL: https://issues.apache.org/jira/browse/SPARK-47148
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 4.0.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: pull-request-available
>
> AQE can materialize both *ShuffleQueryStage* and *BroadcastQueryStage* on the 
> cancellation. This causes unnecessary stage materialization by submitting 
> Shuffle Job and Broadcast Job. Under normal circumstances, if the stage is 
> already non-materialized (a.k.a *ShuffleQueryStage.shuffleFuture* or 
> *{{BroadcastQueryStage.broadcastFuture}}* is not initialized yet), it should 
> just be skipped without materializing it.
> Please find sample use-case:
> *1- Stage Materialization Steps:*
> When stage materialization is failed:
> {code:java}
> 1.1- ShuffleQueryStage1 - is materialized successfully,
> 1.2- ShuffleQueryStage2 - materialization is failed,
> 1.3- ShuffleQueryStage3 - Not materialized yet so 
> ShuffleQueryStage3.shuffleFuture is not initialized yet{code}
> *2- Stage Cancellation Steps:*
> {code:java}
> 2.1- ShuffleQueryStage1 - is canceled due to already materialized,
> 2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as 
> default by AQE because it could not be materialized,
> 2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet 
> but currently, it is also tried to cancel and this stage requires to be 
> materialized first.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47282) 'parseTableIdentifier' fails when a catalog name is provided

2024-03-05 Thread Denis Tarima (Jira)
Denis Tarima created SPARK-47282:


 Summary: 'parseTableIdentifier' fails when a catalog name is 
provided
 Key: SPARK-47282
 URL: https://issues.apache.org/jira/browse/SPARK-47282
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0, 4.0.0
Reporter: Denis Tarima


{code:scala}
spark.sessionState.sqlParser.parseTableIdentifier(
  "`my catalog`.`my database`.`my table`"
)
{code}
fails with
{code:scala}
org.apache.spark.sql.catalyst.parser.ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near '.'.(line 1, pos 26)

== SQL ==
`my catalog`.`my database`.`my table`
--^^^

at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:257)
at 
org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:98)
at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:54)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableIdentifier(AbstractSqlParser.scala:41)
{code}
 

Note: It works as expected on Databricks clusters (verified with Spark 3.3.2 
and 3.5.0).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string

2024-03-05 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-47177:
--
Fix Version/s: 3.4.3

> Cached SQL plan do not display final AQE plan in explain string
> ---
>
> Key: SPARK-47177
> URL: https://issues.apache.org/jira/browse/SPARK-47177
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2, 3.5.0, 4.0.0, 3.5.1, 3.5.2
>Reporter: Ziqi Liu
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>
> AQE plan is expected to display final plan after execution. This is not true 
> for cached SQL plan: it will show the initial plan instead. This behavior 
> change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
> to fix the concurrency issue with cached plan. 
> *In short, the plan used to executed and the plan used to explain is not the 
> same instance, thus causing the inconsistency.*
>  
> I don't have a clear idea how yet
>  * maybe we just a coarse granularity lock in explain?
>  * make innerChildren a function: clone the initial plan, every time checked 
> for whether the original AQE plan is finalized (making the final flag atomic 
> first, of course), if no return the cloned initial plan, if it's finalized, 
> clone the final plan and return that one. But still this won't be able to 
> reflect the AQE plan in real time, in a concurrent situation, but at least we 
> have initial version and final version.
>  
> A simple repro:
> {code:java}
> d1 = spark.range(1000).withColumn("key", expr("id % 
> 100")).groupBy("key").agg({"key": "count"})
> cached_d2 = d1.cache()
> df = cached_d2.filter("key > 10")
> df.collect() {code}
> {code:java}
> >>> df.explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=true
> +- == Final Plan ==
>    *(1) Filter (isnotnull(key#4L) AND (key#4L > 10))
>    +- TableCacheQueryStage 0
>       +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), 
> (key#4L > 10)]
>             +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>                   +- AdaptiveSparkPlan isFinalPlan=false
>                      +- HashAggregate(keys=[key#4L], 
> functions=[count(key#4L)])
>                         +- Exchange hashpartitioning(key#4L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=24]
>                            +- HashAggregate(keys=[key#4L], 
> functions=[partial_count(key#4L)])
>                               +- Project [(id#2L % 100) AS key#4L]
>                                  +- Range (0, 1000, step=1, splits=10)
> +- == Initial Plan ==
>    Filter (isnotnull(key#4L) AND (key#4L > 10))
>    +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L 
> > 10)]
>          +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>                +- AdaptiveSparkPlan isFinalPlan=false
>                   +- HashAggregate(keys=[key#4L], functions=[count(key#4L)])
>                      +- Exchange hashpartitioning(key#4L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=24]
>                         +- HashAggregate(keys=[key#4L], 
> functions=[partial_count(key#4L)])
>                            +- Project [(id#2L % 100) AS key#4L]
>                               +- Range (0, 1000, step=1, splits=10){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47281) Update the `versions. json` file for the already released saprk version

2024-03-05 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-47281:

Affects Version/s: 3.5.1

> Update the `versions. json` file for the already released saprk version
> ---
>
> Key: SPARK-47281
> URL: https://issues.apache.org/jira/browse/SPARK-47281
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47281) Update the `versions. json` file for the already released saprk version

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47281:
---
Labels: pull-request-available  (was: )

> Update the `versions. json` file for the already released saprk version
> ---
>
> Key: SPARK-47281
> URL: https://issues.apache.org/jira/browse/SPARK-47281
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47279:
---
Labels: pull-request-available  (was: )

> spark driver process hangs due to "unable to create new native thread"
> --
>
> Key: SPARK-47279
> URL: https://issues.apache.org/jira/browse/SPARK-47279
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 3.1.1, 3.5.0
>Reporter: TianyiMa
>Priority: Major
>  Labels: pull-request-available
> Attachments: driver_submit_task.png, executor_4.png
>
>
> we encounter that spark driver hangs for about 11 hours,  and finall killed 
> by user. In the driver log there is an error log: 
> {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
> happened while processing message in the inbox for CoarseGrainedScheduler
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:719)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
>         at 
> org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
>         at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
>         at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>         at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>         at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> {quote}
>  
> After detailed analysis, we found that, the driver submitted task 0.0 at 
> "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", 
> then executor 4 sent results to the driver. But in the same time, there is 
> not sufficient memory in the the server that running the driver, the driver 
> "unable to create new native thread" to handle the successful result of task 
> 0.0, then the driver think task 0.0 has not finished and waiting for the 
> "missed result" forever.
>  
> driver submit task 0.0
> !driver_submit_task.png!
>  
> executor 4 task 0.0
> !executor_4.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47281) Update the `versions. json` file for the already released saprk version

2024-03-05 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-47281:
---

 Summary: Update the `versions. json` file for the already released 
saprk version
 Key: SPARK-47281
 URL: https://issues.apache.org/jira/browse/SPARK-47281
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47280) Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47280:
---
Labels: pull-request-available  (was: )

> Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE
> -
>
> Key: SPARK-47280
> URL: https://issues.apache.org/jira/browse/SPARK-47280
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47210) Implicit casting on collated expressions

2024-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47210:
---
Labels: pull-request-available  (was: )

> Implicit casting on collated expressions
> 
>
> Key: SPARK-47210
> URL: https://issues.apache.org/jira/browse/SPARK-47210
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47280) Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE

2024-03-05 Thread Kent Yao (Jira)
Kent Yao created SPARK-47280:


 Summary: Remove timezone limitation for ORACLE TIMESTAMP WITH 
TIMEZONE
 Key: SPARK-47280
 URL: https://issues.apache.org/jira/browse/SPARK-47280
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org