[jira] [Updated] (SPARK-47299) Use the same `versions. json` in the dropdown of different versions of PySpark documents
[ https://issues.apache.org/jira/browse/SPARK-47299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47299: --- Labels: pull-request-available (was: ) > Use the same `versions. json` in the dropdown of different versions of > PySpark documents > > > Key: SPARK-47299 > URL: https://issues.apache.org/jira/browse/SPARK-47299 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0, 3.5.1 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47299) Use the same `versions. json` in the dropdown of different versions of PySpark documents
BingKun Pan created SPARK-47299: --- Summary: Use the same `versions. json` in the dropdown of different versions of PySpark documents Key: SPARK-47299 URL: https://issues.apache.org/jira/browse/SPARK-47299 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 3.5.1, 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47298) Upgrade `mysql-connector-j` to `8.3.0` and `mariadb-java-client` to `2.7.12`
[ https://issues.apache.org/jira/browse/SPARK-47298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47298: --- Labels: pull-request-available (was: ) > Upgrade `mysql-connector-j` to `8.3.0` and `mariadb-java-client` to `2.7.12` > > > Key: SPARK-47298 > URL: https://issues.apache.org/jira/browse/SPARK-47298 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47298) Upgrade `mysql-connector-j` to `8.3.0` and `mariadb-java-client` to `2.7.12`
BingKun Pan created SPARK-47298: --- Summary: Upgrade `mysql-connector-j` to `8.3.0` and `mariadb-java-client` to `2.7.12` Key: SPARK-47298 URL: https://issues.apache.org/jira/browse/SPARK-47298 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47297) regular expressions
Uroš Bojanić created SPARK-47297: Summary: regular expressions Key: SPARK-47297 URL: https://issues.apache.org/jira/browse/SPARK-47297 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47293) Build batchSchema with total sparkSchema instead of append one by one
[ https://issues.apache.org/jira/browse/SPARK-47293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-47293. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45396 [https://github.com/apache/spark/pull/45396] > Build batchSchema with total sparkSchema instead of append one by one > - > > Key: SPARK-47293 > URL: https://issues.apache.org/jira/browse/SPARK-47293 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Binjie Yang >Assignee: Binjie Yang >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > We can simply init batchSchema with whole sparkSchema instead of append one > by one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47293) Build batchSchema with total sparkSchema instead of append one by one
[ https://issues.apache.org/jira/browse/SPARK-47293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-47293: Assignee: Binjie Yang > Build batchSchema with total sparkSchema instead of append one by one > - > > Key: SPARK-47293 > URL: https://issues.apache.org/jira/browse/SPARK-47293 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Binjie Yang >Assignee: Binjie Yang >Priority: Minor > Labels: pull-request-available > > We can simply init batchSchema with whole sparkSchema instead of append one > by one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47296) fail all unsupported functions
Uroš Bojanić created SPARK-47296: Summary: fail all unsupported functions Key: SPARK-47296 URL: https://issues.apache.org/jira/browse/SPARK-47296 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47295) startswith, endswith (non-binary collations)
Uroš Bojanić created SPARK-47295: Summary: startswith, endswith (non-binary collations) Key: SPARK-47295 URL: https://issues.apache.org/jira/browse/SPARK-47295 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46835) Join support for strings with collation
[ https://issues.apache.org/jira/browse/SPARK-46835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46835. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45389 [https://github.com/apache/spark/pull/45389] > Join support for strings with collation > --- > > Key: SPARK-46835 > URL: https://issues.apache.org/jira/browse/SPARK-46835 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46835) Join support for strings with collation
[ https://issues.apache.org/jira/browse/SPARK-46835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-46835: --- Assignee: Aleksandar Tomic > Join support for strings with collation > --- > > Key: SPARK-46835 > URL: https://issues.apache.org/jira/browse/SPARK-46835 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47294) OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec)
[ https://issues.apache.org/jira/browse/SPARK-47294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu resolved SPARK-47294. --- Resolution: Not A Problem > OptimizeSkewInRebalanceRepartitions should support > ProjectExec(_,ShuffleQueryStageExec) > --- > > Key: SPARK-47294 > URL: https://issues.apache.org/jira/browse/SPARK-47294 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1 >Reporter: angerszhu >Priority: Major > Labels: pull-request-available > > Current OptimizeSkewInRebalanceRepartitions only support match > ShuffleQueryStageExec, this case only support SQL query, can't work when > insert since there have a project between ShuffleQueryStageExec and insert > command > {code:java} > plan transformUp { > case p @ ProjectExec(_, stage: ShuffleQueryStageExec) if > isSupported(stage.shuffle) => > p.copy(child = tryOptimizeSkewedPartitions(stage)) > case stage: ShuffleQueryStageExec if isSupported(stage.shuffle) => > tryOptimizeSkewedPartitions(stage) > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47294) OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec)
[ https://issues.apache.org/jira/browse/SPARK-47294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47294: --- Labels: pull-request-available (was: ) > OptimizeSkewInRebalanceRepartitions should support > ProjectExec(_,ShuffleQueryStageExec) > --- > > Key: SPARK-47294 > URL: https://issues.apache.org/jira/browse/SPARK-47294 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1 >Reporter: angerszhu >Priority: Major > Labels: pull-request-available > > Current OptimizeSkewInRebalanceRepartitions only support match > ShuffleQueryStageExec, this case only support SQL query, can't work when > insert since there have a project between ShuffleQueryStageExec and insert > command > {code:java} > plan transformUp { > case p @ ProjectExec(_, stage: ShuffleQueryStageExec) if > isSupported(stage.shuffle) => > p.copy(child = tryOptimizeSkewedPartitions(stage)) > case stage: ShuffleQueryStageExec if isSupported(stage.shuffle) => > tryOptimizeSkewedPartitions(stage) > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47294) OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec)
[ https://issues.apache.org/jira/browse/SPARK-47294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-47294: -- Description: Current OptimizeSkewInRebalanceRepartitions only support match ShuffleQueryStageExec, this case only support SQL query, can't work when insert since there have a project between ShuffleQueryStageExec and insert command {code:java} plan transformUp { case p @ ProjectExec(_, stage: ShuffleQueryStageExec) if isSupported(stage.shuffle) => p.copy(child = tryOptimizeSkewedPartitions(stage)) case stage: ShuffleQueryStageExec if isSupported(stage.shuffle) => tryOptimizeSkewedPartitions(stage) } {code} > OptimizeSkewInRebalanceRepartitions should support > ProjectExec(_,ShuffleQueryStageExec) > --- > > Key: SPARK-47294 > URL: https://issues.apache.org/jira/browse/SPARK-47294 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1 >Reporter: angerszhu >Priority: Major > > Current OptimizeSkewInRebalanceRepartitions only support match > ShuffleQueryStageExec, this case only support SQL query, can't work when > insert since there have a project between ShuffleQueryStageExec and insert > command > {code:java} > plan transformUp { > case p @ ProjectExec(_, stage: ShuffleQueryStageExec) if > isSupported(stage.shuffle) => > p.copy(child = tryOptimizeSkewedPartitions(stage)) > case stage: ShuffleQueryStageExec if isSupported(stage.shuffle) => > tryOptimizeSkewedPartitions(stage) > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47294) OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec)
angerszhu created SPARK-47294: - Summary: OptimizeSkewInRebalanceRepartitions should support ProjectExec(_,ShuffleQueryStageExec) Key: SPARK-47294 URL: https://issues.apache.org/jira/browse/SPARK-47294 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1, 4.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19256) Hive bucketing write support
[ https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823851#comment-17823851 ] Shreyas commented on SPARK-19256: - Don't really like spamming in the comments - but this is a much needed feature for big-data processing, and has been pending for a while now. Can this be given some love please? :D > Hive bucketing write support > > > Key: SPARK-19256 > URL: https://issues.apache.org/jira/browse/SPARK-19256 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Tejas Patil >Priority: Minor > > Update (2020 by Cheng Su): > We use this JIRA to track progress for Hive bucketing write support in Spark. > The goal is for Spark to write Hive bucketed table, to be compatible with > other compute engines (Hive and Presto). > > Current status for Hive bucketed table in Spark: > Not support for reading Hive bucketed table: read bucketed table as > non-bucketed table. > Wrong behavior for writing Hive ORC and Parquet bucketed table: write > orc/parquet bucketed table as non-bucketed table (code path: > InsertIntoHadoopFsRelationCommand -> FileFormatWriter). > Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception > by default if writing non-orc/parquet bucketed table (code path: > InsertIntoHiveTable), and exception can be disabled by setting config > `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will > write as non-bucketed table. > > Current status for Hive bucketed table in Hive: > Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash > (https://issues.apache.org/jira/browse/HIVE-18910). > Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash. > Hive on Tez: support zero and multiple files per bucket > (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on > read path - > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212] > . > > Current status for Hive bucketed table in Presto (take presto-sql here): > Support writing bucketed table with Hive murmur3hash and hivehash > ([https://github.com/prestosql/presto/pull/1697]). > Support zero and multiple files per bucket > ([https://github.com/prestosql/presto/pull/822]). > > TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and > Hive. Here with this JIRA, we need to add support writing Hive bucketed table > with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and > 2.x.y). > > To allow Spark efficiently read Hive bucketed table, this needs more radical > change and we decide to wait until data source v2 supports bucketing, and do > the read path on data source v2. Read path will not covered by this JIRA. > > Original description (2017 by Tejas Patil): > JIRA to track design discussions and tasks related to Hive bucketing support > in Spark. > Proposal : > [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47293) Build batchSchema with total sparkSchema instead of append one by one
[ https://issues.apache.org/jira/browse/SPARK-47293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47293: --- Labels: pull-request-available (was: ) > Build batchSchema with total sparkSchema instead of append one by one > - > > Key: SPARK-47293 > URL: https://issues.apache.org/jira/browse/SPARK-47293 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Binjie Yang >Priority: Minor > Labels: pull-request-available > > We can simply init batchSchema with whole sparkSchema instead of append one > by one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47293) Build batchSchema with total sparkSchema instead of append one by one
Binjie Yang created SPARK-47293: --- Summary: Build batchSchema with total sparkSchema instead of append one by one Key: SPARK-47293 URL: https://issues.apache.org/jira/browse/SPARK-47293 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Binjie Yang We can simply init batchSchema with whole sparkSchema instead of append one by one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47280) Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE
[ https://issues.apache.org/jira/browse/SPARK-47280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-47280. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45384 [https://github.com/apache/spark/pull/45384] > Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE > - > > Key: SPARK-47280 > URL: https://issues.apache.org/jira/browse/SPARK-47280 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47247) use smaller target size when coalescing partitions with exploding joins
[ https://issues.apache.org/jira/browse/SPARK-47247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47247. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45357 [https://github.com/apache/spark/pull/45357] > use smaller target size when coalescing partitions with exploding joins > --- > > Key: SPARK-47247 > URL: https://issues.apache.org/jira/browse/SPARK-47247 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47247) use smaller target size when coalescing partitions with exploding joins
[ https://issues.apache.org/jira/browse/SPARK-47247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47247: --- Assignee: Wenchen Fan > use smaller target size when coalescing partitions with exploding joins > --- > > Key: SPARK-47247 > URL: https://issues.apache.org/jira/browse/SPARK-47247 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40763) Should expose driver service name to config for user features
[ https://issues.apache.org/jira/browse/SPARK-40763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-40763: --- Labels: pull-request-available (was: ) > Should expose driver service name to config for user features > - > > Key: SPARK-40763 > URL: https://issues.apache.org/jira/browse/SPARK-40763 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: binjie yang >Priority: Minor > Labels: pull-request-available > > Current on kubernetes, user's feature step, which build user's kubernetes > resource during spark submit spark pod, can't percept some spark resource > info, such as spark driver service name. > > User may want to expose some spark pod info to build their custom resource, > such as ingress, etc. > > We want the way expose now spark driver service name, which is now generated > by clock and uuid. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47146) Possible thread leak when doing sort merge join
[ https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-47146: Fix Version/s: 3.5.2 3.4.3 > Possible thread leak when doing sort merge join > --- > > Key: SPARK-47146 > URL: https://issues.apache.org/jira/browse/SPARK-47146 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: JacobZheng >Assignee: JacobZheng >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.3 > > > I have a long-running spark job. stumbled upon executor taking up a lot of > threads, resulting in no threads available on the server. Querying thread > details via jstack, there are tons of threads named read-ahead. Checking the > code confirms that these threads are created by ReadAheadInputStream. This > class is initialized to create a single-threaded thread pool > {code:java} > private final ExecutorService executorService = > ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code} > This thread pool is closed by ReadAheadInputStream#close(). > The call stack for the normal case close() method is > {code:java} > ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 > in stage 71.0 (TID > 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230 > @org.apache.spark.io.ReadAheadInputStream.close() > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:829) {code} > As shown in UnsafeSorterSpillReader#close, the stream is only closed when the > data in the stream is read through. > {code:java} > @Override > public void loadNext() throws IOException { > // Kill the task in case it has been marked as killed. This logic is from > // InterruptibleIterator, but we inline it here instead of wrapping the > iterator in order > // to avoid performance overhead. This check is ad
[jira] [Commented] (SPARK-47146) Possible thread leak when doing sort merge join
[ https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823832#comment-17823832 ] Mridul Muralidharan commented on SPARK-47146: - Backported to 3.5 and 3.4 in PR: https://github.com/apache/spark/pull/45390 > Possible thread leak when doing sort merge join > --- > > Key: SPARK-47146 > URL: https://issues.apache.org/jira/browse/SPARK-47146 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: JacobZheng >Assignee: JacobZheng >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.3 > > > I have a long-running spark job. stumbled upon executor taking up a lot of > threads, resulting in no threads available on the server. Querying thread > details via jstack, there are tons of threads named read-ahead. Checking the > code confirms that these threads are created by ReadAheadInputStream. This > class is initialized to create a single-threaded thread pool > {code:java} > private final ExecutorService executorService = > ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code} > This thread pool is closed by ReadAheadInputStream#close(). > The call stack for the normal case close() method is > {code:java} > ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 > in stage 71.0 (TID > 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230 > @org.apache.spark.io.ReadAheadInputStream.close() > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:829) {code} > As shown in UnsafeSorterSpillReader#close, the stream is only closed when the > data in the stream is read through. > {code:java} > @Override > public void loadNext() throws IOException { > // Kill the task in case it has been marked as killed. This logic is from > // InterruptibleIterator, but we inline it here instead of wrap
[jira] [Resolved] (SPARK-47285) AdaptiveSparkPlanExec should always use the context.session
[ https://issues.apache.org/jira/browse/SPARK-47285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-47285. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45388 [https://github.com/apache/spark/pull/45388] > AdaptiveSparkPlanExec should always use the context.session > --- > > Key: SPARK-47285 > URL: https://issues.apache.org/jira/browse/SPARK-47285 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47285) AdaptiveSparkPlanExec should always use the context.session
[ https://issues.apache.org/jira/browse/SPARK-47285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-47285: Assignee: XiDuo You > AdaptiveSparkPlanExec should always use the context.session > --- > > Key: SPARK-47285 > URL: https://issues.apache.org/jira/browse/SPARK-47285 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47292) safeMapToJValue should consider when map is null
Wei Liu created SPARK-47292: --- Summary: safeMapToJValue should consider when map is null Key: SPARK-47292 URL: https://issues.apache.org/jira/browse/SPARK-47292 Project: Spark Issue Type: New Feature Components: Connect, SS Affects Versions: 3.5.1, 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44746) Improve the documentation for TABLE input arguments for UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44746. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45375 [https://github.com/apache/spark/pull/45375] > Improve the documentation for TABLE input arguments for UDTFs > - > > Key: SPARK-44746 > URL: https://issues.apache.org/jira/browse/SPARK-44746 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Daniel >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We should add more examples for using Python UDTFs with TABLE arguments. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44746) Improve the documentation for TABLE input arguments for UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44746: Assignee: Daniel > Improve the documentation for TABLE input arguments for UDTFs > - > > Key: SPARK-44746 > URL: https://issues.apache.org/jira/browse/SPARK-44746 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Daniel >Priority: Major > Labels: pull-request-available > > We should add more examples for using Python UDTFs with TABLE arguments. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF
[ https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47277. -- Fix Version/s: 4.0.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/45380 > PySpark util function assertDataFrameEqual should not support streaming DF > -- > > Key: SPARK-47277 > URL: https://issues.apache.org/jira/browse/SPARK-47277 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark, SQL, Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Wei Liu >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47291) Add Parquet file reader metrics to scan
Parth Chandra created SPARK-47291: - Summary: Add Parquet file reader metrics to scan Key: SPARK-47291 URL: https://issues.apache.org/jira/browse/SPARK-47291 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Parth Chandra With the addition of external metrics support in Parquet ([PARQUET-2374|https://issues.apache.org/jira/browse/PARQUET-2374]), it is now possible to gather file level scan metrics and have them displayed in a Parquet Scan's metrics. This can be done for both DSV1 and DSV2 implementations by providing a ParquetMetricsCallback implementation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47290) Extend CustomTaskMetric to allow metric values from multiple sources
Parth Chandra created SPARK-47290: - Summary: Extend CustomTaskMetric to allow metric values from multiple sources Key: SPARK-47290 URL: https://issues.apache.org/jira/browse/SPARK-47290 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Parth Chandra Custom task metrics allow a DSV2 to add metrics that can be displayed in the UI. However, for DSV2 file sources, the FilePartitionReader may have multiple file readers, and each of these may report their metrics which need to be aggregated and bubbled up to the Scan. There is no way currently to update a metric value. A new interface that extends CustomTaskMetric and defines a way to allow updates would allow a DSV2 file scan implementation to over come this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours
[ https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823510#comment-17823510 ] Asif edited comment on SPARK-33152 at 3/5/24 6:43 PM: -- [~tedjenks] .. Unfortunately I am not a committer. As part of workday , I had opened this Jira and opened a PR to fix this issue completely which required a different logic. The changes are extensive and they were never reviewed or dicussed by OS community. This PR has been in production since past 3 years at Workday. As to why a check is not added, etc,.,: That would be unclean and as such is not easy to implement also in current codebase, because it will result in various other issues like new redundant filters being inferred and other messy bugs as the constraint code is sensitive to constraints coming from each node below and the constraints available at current node, to decide whether to create new filters or not. Constrainst are created per operator node ( project, filter etc) and arbitrary putting a limit on constraints at a given operator , will impact the new filters being created. was (Author: ashahid7): [~tedjenks] .. Unfortunately I am not a committer. As part of workday , I had opened this Jira and opened a PR to fix this issue completely which required a different logic. The changes are extensive and they were never reviewed or dicussed by OS community. This PR has been in production since past 3 years at Workday. As to why a check is not added, etc,.,: That would be unclean and as such is not easy to implement also in current codebase, because it will result in various other issues like new/wrong filters being inferred and other messy bugs as the constraint code is sensitive to constraints coming from each node below and the constraints available at current node, to decide whether to create new filters or not. Constrainst are created per operator node ( project, filter etc) and arbitrary putting a limit on constraints at a given operator , will impact the new filters being created. > SPIP: Constraint Propagation code causes OOM issues or increasing compilation > time to hours > --- > > Key: SPARK-33152 > URL: https://issues.apache.org/jira/browse/SPARK-33152 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > Labels: SPIP > Original Estimate: 168h > Remaining Estimate: 168h > > h2. Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon. > Proposing new algorithm to create, store and use constraints for removing > redundant filters & inferring new filters. > The current algorithm has subpar performance in complex expression scenarios > involving aliases( with certain use cases the compilation time can go into > hours), potential to cause OOM, may miss removing redundant filters in > different scenarios, may miss creating IsNotNull constraints in different > scenarios, does not push compound predicates in Join. > # This issue if not fixed can cause OutOfMemory issue or unacceptable query > compilation times. > Have added a test "plan equivalence with case statements and performance > comparison with benefit of more than 10x conservatively" in > org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. > *With this PR the compilation time is 247 ms vs 13958 ms without the change* > # It is more effective in filter pruning as is evident in some of the tests > in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite > where current code is not able to identify the redundant filter in some cases. > # It is able to generate a better optimized plan for join queries as it can > push compound predicates. > # The current logic can miss a lot of possible cases of removing redundant > predicates, as it fails to take into account if same attribute or its aliases > are repeated multiple times in a complex expression. > # There are cases where some of the optimizer rules involving removal of > redundant predicates fail to remove on the basis of constraint data. In some > cases the rule works, just by the virtue of previous rules helping it out to > cover the inaccuracy. That the ConstraintPropagation rule & its function of > removal of redundant filters & addition of new inferred filters is dependent > on the working of some of the other unrelated previous optimizer rules is > behaving, is indicative of issues. > # It does away with all the EqualNullSafe constraints as this logic does not > need those constraints to be created. > # There is at least one test in existing ConstraintPropagationSuite which is > missing a IsNotNull constraints beca
[jira] [Created] (SPARK-47289) Allow extensions to log extended information in explain plan
Parth Chandra created SPARK-47289: - Summary: Allow extensions to log extended information in explain plan Key: SPARK-47289 URL: https://issues.apache.org/jira/browse/SPARK-47289 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Parth Chandra With session extensions, Spark planning can be extended to apply additional rules and modify the execution plan. If an extension replaces a node in the plan, the new node will be displayed in the plan. However, it is sometimes useful for extensions provided extended information to the end user to explain the impact of the extension. For instance an extension may automatically enable/disable some feature that it provides and can provide this extended information in the plan. The proposal is to optionally turn on extended plan information from extensions. Extensions can add additional planning information via a new interface that internally uses a new TreeNodeTag, say 'explainPlan'. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47033) EXECUTE IMMEDIATE USING does not recognize session variable names
[ https://issues.apache.org/jira/browse/SPARK-47033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47033. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45293 [https://github.com/apache/spark/pull/45293] > EXECUTE IMMEDIATE USING does not recognize session variable names > - > > Key: SPARK-47033 > URL: https://issues.apache.org/jira/browse/SPARK-47033 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {noformat} > DECLARE parm = 'Hello'; > EXECUTE IMMEDIATE 'SELECT :parm' USING parm; > [ALL_PARAMETERS_MUST_BE_NAMED] Using name parameterized queries requires all > parameters to be named. Parameters missing names: "parm". SQLSTATE: 07001 > EXECUTE IMMEDIATE 'SELECT :parm' USING parm AS parm; > Hello > {noformat} > variables are like column references, they act as their own aliases and thus > should not be required to be named to associate with a named parameter with > the same name. > Note that unlike for pySpark this should be case insensitive (haven't > verified). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43883) CTAS Command Nodes Prevent Some Optimizer Rules From Running
[ https://issues.apache.org/jira/browse/SPARK-43883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Chester Jenks resolved SPARK-43883. --- Resolution: Won't Fix > CTAS Command Nodes Prevent Some Optimizer Rules From Running > > > Key: SPARK-43883 > URL: https://issues.apache.org/jira/browse/SPARK-43883 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: Ted Chester Jenks >Priority: Major > Attachments: Not Working - Create Table.png, Working - 3.2.0.png, > Working - No Create Table.png > > > The changes introduced to resolve SPARK-41713 in > [https://github.com/apache/spark/pull/39220] modified the CTAS commands from > having a `DataWritingCommand` trait to a `LeafRunnableCommand` trait. The > `DataWritingCommand` trait extends `UnaryCommand`, and has children set to > the value of query in the CTAS command. This means that when `transform` is > called to traverse the tree with the CTAS command at the root, the entire > query is traversed. `LeafRunnableCommand` has a `LeafLike` trait which > explicitly sets the value of children to `Nil`. This means that when > `transform` is called on the command, no children are found and the query is > unaffected by the rule. > In practice, this means that optimizer rules that rely on `transform` (such > as `BooleanSimplification`) to traverse the tree do not work with a CTAS. > This can be demonstrated with a simple query in spark-shell. Without the CTAS > we can run a command with an easily simplified boolean expression (`id == 9 > && id == 9`) and see it gets optimized out: > !Working - No Create Table.png|width=883,height=342! > With a CTAS, the optimisation does not get applied (as we can see from the > `AND` still present in the optimized and physical plans): > !Not Working - Create Table.png|width=885,height=524! > This works in 3.2.0 which had the old CTAS implementation: > !Working - 3.2.0.png|width=885,height=345! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47288) DataType __repr__ change breaks datatype checking (anit-)pattern
[ https://issues.apache.org/jira/browse/SPARK-47288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823717#comment-17823717 ] Ted Chester Jenks commented on SPARK-47288: --- [~gurwls223] I saw you on the original PR, curious for you thoughts. > DataType __repr__ change breaks datatype checking (anit-)pattern > > > Key: SPARK-47288 > URL: https://issues.apache.org/jira/browse/SPARK-47288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Ted Chester Jenks >Priority: Major > > This pr: [https://github.com/apache/spark/pull/34320] > Made reprs for datatype eval-able. This is kind of nice, but we have a ton of > users doing stuff like: > > {code:java} > if str(data_type) == "StringType": > ... > {code} > > Which breaks. > > What would people think of adding a __str__ to the base class that returns > the old behaviour so we can have the best of both worlds. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47288) DataType __repr__ change breaks datatype checking (anit-)pattern
Ted Chester Jenks created SPARK-47288: - Summary: DataType __repr__ change breaks datatype checking (anit-)pattern Key: SPARK-47288 URL: https://issues.apache.org/jira/browse/SPARK-47288 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1 Reporter: Ted Chester Jenks This pr: [https://github.com/apache/spark/pull/34320] Made reprs for datatype eval-able. This is kind of nice, but we have a ton of users doing stuff like: {code:java} if str(data_type) == "StringType": ... {code} Which breaks. What would people think of adding a __str__ to the base class that returns the old behaviour so we can have the best of both worlds. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823713#comment-17823713 ] Nicholas Chammas commented on SPARK-46810: -- [~cloud_fan], [~LuciferYang], [~beliefer], and [~dongjoon] - Friendly ping. Any thoughts on how to resolve the inconsistent error terminology? > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. > It has several key pieces of hierarchical information that have inconsistent > names throughout our documentation and codebase: > * 42 > ** K01 > *** INCOMPLETE_TYPE_DEFINITION > ARRAY > MAP > STRUCT > What are the names of these different levels of information? > Some examples of inconsistent terminology: > * [Over > here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] > we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION > we call that an "error class". So what exactly is a class, the 42 or the > INCOMPLETE_TYPE_DEFINITION? > * [Over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] > we call K01 the "subclass". But [over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] > we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for > INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". > So what exactly is a subclass? > * [On this > page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] > we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other > places we refer to it as an "error class". > I don't think we should leave this status quo as-is. I see a couple of ways > to fix this. > h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" > One solution is to use the following terms: > * Error class: 42 > * Error sub-class: K01 > * Error state: 42K01 > * Error condition: INCOMPLETE_TYPE_DEFINITION > * Error sub-condition: ARRAY, MAP, STRUCT > Pros: > * This terminology seems (to me at least) the most natural and intuitive. > * It aligns most closely to the SQL standard. > Cons: > * We use {{errorClass}} [all over our > codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] > – literally in thousands of places – to refer to strings like > INCOMPLETE_TYPE_DEFINITION. > ** It's probably not practical to update all these usages to say > {{errorCondition}} instead, so if we go with this approach there will be a > divide between the terminology we use in user-facing documentation vs. what > the code base uses. > ** We can perhaps rename the existing {{error-classes.json}} to > {{error-conditions.json}} but clarify the reason for this divide between code > and user docs in the documentation for {{ErrorClassesJsonReader}} . > h1. Option 2: 42 becomes an "Error Category" > Another approach is to use the following terminology: > * Error category: 42 > * Error sub-category: K01 > * Error state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The change from calling "42" a "class" to a "category" is low impact and > may not show up in user-facing documentation at all. (See my side note below.) > Cons: > * These terms do not align with the SQL standard. > * We will have to retire the term "error condition", which we have [already > used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md] > in user-facing documentation. > h1. Option 3: "Error Class" and "State Class" > * SQL state class: 42 > * SQL state sub-class: K01 > * SQL state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The chang
[jira] [Updated] (SPARK-47281) Update the `versions. json` file for the already released spark version
[ https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-47281: Summary: Update the `versions. json` file for the already released spark version (was: Update the `versions. json` file for the already released saprk version) > Update the `versions. json` file for the already released spark version > --- > > Key: SPARK-47281 > URL: https://issues.apache.org/jira/browse/SPARK-47281 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0, 3.5.1 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47287) Aggregate in not causes
[ https://issues.apache.org/jira/browse/SPARK-47287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Chester Jenks updated SPARK-47287: -- Description: The below snippet is confirmed working with Spark 3.2.1 and broken Spark 3.4.1. i believe this is a bug. {code:java} Dataset ds = dummyDataset .withColumn("flag", functions.not(functions.coalesce(functions.col("bool1"), functions.lit(false)).equalTo(true))) .groupBy("code") .agg(functions.max(functions.col("flag")).alias("flag")); ds.show(); {code} It fails with: {code:java} Caused by: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.$anonfun$generateExpression$7(V2ExpressionBuilder.scala:185) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:184) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33) at org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateAggregateFunc(V2ExpressionBuilder.scala:293) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:98) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33) at org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.translate$1(DataSourceStrategy.scala:700){code} was: The below snippet is confirmed working with Spark 3.2.1 and broken Spark 3.4.1. i believe this is a bug. {code:java} Dataset ds = dummyDataset .withColumn("flag", functions.not(functions.coalesce(functions.col("bool1"), functions.lit(false)).equalTo(true))) .groupBy("code") .agg(functions.max(functions.col("flag")).alias("flag")); ds.show(); {code} It fails with: {code:java} Caused by: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.$anonfun$generateExpression$7(V2ExpressionBuilder.scala:185) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:184) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33) at org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateAggregateFunc(V2ExpressionBuilder.scala:293) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:98) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33) at org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.translate$1(DataSourceStrategy.scala:700) {code} > Aggregate in not causes > > > Key: SPARK-47287 > URL: https://issues.apache.org/jira/browse/SPARK-47287 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Ted Chester Jenks >Priority: Major > > > The below snippet is confirmed working with Spark 3.2.1 and broken Spark > 3.4.1. i believe this is a bug. > {code:java} >Dataset ds = dummyDataset > .withColumn("flag", > functions.not(functions.coalesce(functions.col("bool1"), > functions.lit(false)).equalTo(true))) > .groupBy("code") > .agg(functions.max(functions.col("flag")).alias("flag")); > ds.show(); {code} > It fails with: > {code:java} > Caused by: java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:208) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.$anonfun$generateExpression$7(V2ExpressionBuilder.scala:185) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:184) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33) > at > org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803) > at > org.apache.spark.sql.
[jira] [Created] (SPARK-47287) Aggregate in not causes
Ted Chester Jenks created SPARK-47287: - Summary: Aggregate in not causes Key: SPARK-47287 URL: https://issues.apache.org/jira/browse/SPARK-47287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1 Reporter: Ted Chester Jenks The below snippet is confirmed working with Spark 3.2.1 and broken Spark 3.4.1. i believe this is a bug. {code:java} Dataset ds = dummyDataset .withColumn("flag", functions.not(functions.coalesce(functions.col("bool1"), functions.lit(false)).equalTo(true))) .groupBy("code") .agg(functions.max(functions.col("flag")).alias("flag")); ds.show(); {code} It fails with: {code:java} Caused by: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.$anonfun$generateExpression$7(V2ExpressionBuilder.scala:185) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:184) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33) at org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateAggregateFunc(V2ExpressionBuilder.scala:293) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:98) at org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33) at org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.translate$1(DataSourceStrategy.scala:700) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47198) Is it possible to dynamically add backend service to ingress with Kubernetes?
[ https://issues.apache.org/jira/browse/SPARK-47198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823616#comment-17823616 ] Kent Yao commented on SPARK-47198: -- Hi [~melin], it would be better to ask your questions via the mailing list so that more people can see them and provide their attention to it. > Is it possible to dynamically add backend service to ingress with Kubernetes? > - > > Key: SPARK-47198 > URL: https://issues.apache.org/jira/browse/SPARK-47198 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: melin >Priority: Major > > spark on k8s runs multiple spark apps at the same time. proxy/[sparkappid] > path, forwarding to different sparkapp ui console based on sparkappid. spark > apps are dynamically added and decreased. ingress Dynamically adds spark svc. > [sparkappid]_svc == spark svc name > [https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-ingress-guide-nginx-example.html] > [~Qin Yao] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47198) Is it possible to dynamically add backend service to ingress with Kubernetes?
[ https://issues.apache.org/jira/browse/SPARK-47198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-47198. -- Resolution: Information Provided > Is it possible to dynamically add backend service to ingress with Kubernetes? > - > > Key: SPARK-47198 > URL: https://issues.apache.org/jira/browse/SPARK-47198 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: melin >Priority: Major > > spark on k8s runs multiple spark apps at the same time. proxy/[sparkappid] > path, forwarding to different sparkapp ui console based on sparkappid. spark > apps are dynamically added and decreased. ingress Dynamically adds spark svc. > [sparkappid]_svc == spark svc name > [https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-ingress-guide-nginx-example.html] > [~Qin Yao] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47168) Disable parquet filter pushdown for non default collated strings
[ https://issues.apache.org/jira/browse/SPARK-47168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47168. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45262 [https://github.com/apache/spark/pull/45262] > Disable parquet filter pushdown for non default collated strings > > > Key: SPARK-47168 > URL: https://issues.apache.org/jira/browse/SPARK-47168 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Assignee: Stefan Kandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"
[ https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TianyiMa updated SPARK-47279: - Description: we encounter that spark driver hangs for about 11 hours, and finall killed by user. In the driver log there is an error log: {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:719) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} After detailed analysis, we found that, the driver submitted task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sent results to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native thread" to handle the successful result of task 0.0, then the driver think task 0.0 has not finished and waiting for the "missed result" forever. driver submit task 0.0 !driver_submit_task.png! executor 4 task 0.0 !executor_4.png! oom-killer: !oom-killer.png! was: we encounter that spark driver hangs for about 11 hours, and finall killed by user. In the driver log there is an error log: {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:719) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} After detailed analysis, we found that, the driver submitted task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sent results to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native
[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"
[ https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TianyiMa updated SPARK-47279: - Attachment: oom-killer.png > spark driver process hangs due to "unable to create new native thread" > -- > > Key: SPARK-47279 > URL: https://issues.apache.org/jira/browse/SPARK-47279 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 3.1.1, 3.5.0 >Reporter: TianyiMa >Priority: Major > Labels: pull-request-available > Attachments: driver_submit_task.png, executor_4.png, oom-killer.png > > > we encounter that spark driver hangs for about 11 hours, and finall killed > by user. In the driver log there is an error log: > {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error > happened while processing message in the inbox for CoarseGrainedScheduler > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:719) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) > at > org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) > at > org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) > at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > > After detailed analysis, we found that, the driver submitted task 0.0 at > "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", > then executor 4 sent results to the driver. But in the same time, there is > not sufficient memory in the the server that running the driver, the driver > "unable to create new native thread" to handle the successful result of task > 0.0, then the driver think task 0.0 has not finished and waiting for the > "missed result" forever. > > driver submit task 0.0 > !driver_submit_task.png! > > executor 4 task 0.0 > !executor_4.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag
[ https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-47102: Assignee: Mihailo Milosevic > Add COLLATION_ENABLED config flag > - > > Key: SPARK-47102 > URL: https://issues.apache.org/jira/browse/SPARK-47102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Assignee: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error > class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage > of feature under development. > *Why are the changes needed?* > We want to make collations configurable on this flag. These changes disable > usage of `collate` and `collation` functions, along with any `COLLATE` syntax > when the flag is set to false. By default, the flag is set to false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47102) Add COLLATION_ENABLED config flag
[ https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-47102. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45285 [https://github.com/apache/spark/pull/45285] > Add COLLATION_ENABLED config flag > - > > Key: SPARK-47102 > URL: https://issues.apache.org/jira/browse/SPARK-47102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Assignee: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > *What changes were proposed in this pull request?* > This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error > class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage > of feature under development. > *Why are the changes needed?* > We want to make collations configurable on this flag. These changes disable > usage of `collate` and `collation` functions, along with any `COLLATE` syntax > when the flag is set to false. By default, the flag is set to false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46835) Join support for strings with collation
[ https://issues.apache.org/jira/browse/SPARK-46835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46835: --- Labels: pull-request-available (was: ) > Join support for strings with collation > --- > > Key: SPARK-46835 > URL: https://issues.apache.org/jira/browse/SPARK-46835 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47239) Support distinct window function
[ https://issues.apache.org/jira/browse/SPARK-47239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingliang Zhu updated SPARK-47239: -- Issue Type: New Feature (was: Bug) > Support distinct window function > > > Key: SPARK-47239 > URL: https://issues.apache.org/jira/browse/SPARK-47239 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.5.0 >Reporter: Mingliang Zhu >Priority: Major > Labels: pull-request-available > > Support distinct window function when window frame is entire partition frame > or > growing frame. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47286) IN operator support
Aleksandar Tomic created SPARK-47286: Summary: IN operator support Key: SPARK-47286 URL: https://issues.apache.org/jira/browse/SPARK-47286 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 4.0.0 Reporter: Aleksandar Tomic At this point following query works fine: ``` sql("select * from t1 where ucs_basic_lcase in ('aaa' collate 'ucs_basic_lcase', 'bbb' collate 'ucs_basic_lcase')").show() ``` But if we were to miss explicit collate or even mix collations: ``` sql("select * from t1 where ucs_basic_lcase in ('aaa' collate 'ucs_basic_lcase', 'bbb'").show() ``` Query would still run and return invalid results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47285) AdaptiveSparkPlanExec should always use the context.session
[ https://issues.apache.org/jira/browse/SPARK-47285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47285: --- Labels: pull-request-available (was: ) > AdaptiveSparkPlanExec should always use the context.session > --- > > Key: SPARK-47285 > URL: https://issues.apache.org/jira/browse/SPARK-47285 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47285) AdaptiveSparkPlanExec should always use the context.session
XiDuo You created SPARK-47285: - Summary: AdaptiveSparkPlanExec should always use the context.session Key: SPARK-47285 URL: https://issues.apache.org/jira/browse/SPARK-47285 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45387) Partition key filter cannot be pushed down when using cast
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823524#comment-17823524 ] TianyiMa commented on SPARK-45387: -- [~doki] the output execution plan is the final result, but the problem lies in the optimize process. In your example, the partition key is stringType, but was cast to int to filter partitions. The driver will get all the partition to do this filter. If you have a hive table with thousands of partitions, this process will very slow and costly. > Partition key filter cannot be pushed down when using cast > -- > > Key: SPARK-45387 > URL: https://issues.apache.org/jira/browse/SPARK-45387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0 >Reporter: TianyiMa >Priority: Critical > Attachments: PruneFileSourcePartitions.diff > > > Suppose we have a partitioned table `table_pt` with partition colum `dt` > which is StringType and the table metadata is managed by Hive Metastore, if > we filter partition by dt = '123', this filter can be pushed down to data > source, but if the filter condition is number, e.g. dt = 123, that cannot be > pushed down to data source, causing spark to pull all of that table's > partition meta data to client, which is poor of performance if the table has > thousands of partitions and increasing the risk of hive metastore oom. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44259) Make `connect-jvm-client` module pass except arrow-related ones in Java 21
[ https://issues.apache.org/jira/browse/SPARK-44259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44259: --- Labels: pull-request-available (was: ) > Make `connect-jvm-client` module pass except arrow-related ones in Java 21 > -- > > Key: SPARK-44259 > URL: https://issues.apache.org/jira/browse/SPARK-44259 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours
[ https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823512#comment-17823512 ] Asif commented on SPARK-33152: -- other than using my PR, the safe option would be to disable constraint propagation rule via sql conf. though that would mean loosing optimizations related to push down of new filters on the other side of join legs etc, > SPIP: Constraint Propagation code causes OOM issues or increasing compilation > time to hours > --- > > Key: SPARK-33152 > URL: https://issues.apache.org/jira/browse/SPARK-33152 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > Labels: SPIP > Original Estimate: 168h > Remaining Estimate: 168h > > h2. Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon. > Proposing new algorithm to create, store and use constraints for removing > redundant filters & inferring new filters. > The current algorithm has subpar performance in complex expression scenarios > involving aliases( with certain use cases the compilation time can go into > hours), potential to cause OOM, may miss removing redundant filters in > different scenarios, may miss creating IsNotNull constraints in different > scenarios, does not push compound predicates in Join. > # This issue if not fixed can cause OutOfMemory issue or unacceptable query > compilation times. > Have added a test "plan equivalence with case statements and performance > comparison with benefit of more than 10x conservatively" in > org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. > *With this PR the compilation time is 247 ms vs 13958 ms without the change* > # It is more effective in filter pruning as is evident in some of the tests > in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite > where current code is not able to identify the redundant filter in some cases. > # It is able to generate a better optimized plan for join queries as it can > push compound predicates. > # The current logic can miss a lot of possible cases of removing redundant > predicates, as it fails to take into account if same attribute or its aliases > are repeated multiple times in a complex expression. > # There are cases where some of the optimizer rules involving removal of > redundant predicates fail to remove on the basis of constraint data. In some > cases the rule works, just by the virtue of previous rules helping it out to > cover the inaccuracy. That the ConstraintPropagation rule & its function of > removal of redundant filters & addition of new inferred filters is dependent > on the working of some of the other unrelated previous optimizer rules is > behaving, is indicative of issues. > # It does away with all the EqualNullSafe constraints as this logic does not > need those constraints to be created. > # There is at least one test in existing ConstraintPropagationSuite which is > missing a IsNotNull constraints because the code incorrectly generated a > EqualsNullSafeConstraint instead of EqualTo constraint, when using the > existing Constraints code. With these changes, the test correctly creates an > EqualTo constraint, resulting in an inferred IsNotNull constraint > # It does away with the current combinatorial logic of evaluation all the > constraints can cause compilation to run into hours or cause OOM. The number > of constraints stored is exactly the same as the number of filters encountered > h2. Q2. What problem is this proposal NOT designed to solve? > It mainly focuses on compile time performance, but in some cases can benefit > run time characteristics too, like inferring IsNotNull filter or pushing down > compound predicates on the join, which currently may get missed/ does not > happen , respectively, by the present code. > h2. Q3. How is it done today, and what are the limits of current practice? > Current ConstraintsPropagation code, pessimistically tries to generates all > the possible combinations of constraints , based on the aliases ( even then > it may miss a lot of combinations if the expression is a complex expression > involving same attribute repeated multiple times within the expression and > there are many aliases to that column). There are query plans in our > production env, which can result in intermediate number of constraints going > into hundreds of thousands, causing OOM or taking time running into hours. > Also there are cases where it incorrectly generates an EqualNullSafe > constraint instead of EqualTo constraint , thus missing a possible IsNull > constraint on column. > Also it only pushes single column predic
[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours
[ https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823510#comment-17823510 ] Asif commented on SPARK-33152: -- [~tedjenks] .. Unfortunately I am not a committer. As part of workday , I had opened this Jira and opened a PR to fix this issue completely which required a different logic. The changes are extensive and they were never reviewed or dicussed by OS community. This PR has been in production since past 3 years at Workday. As to why a check is not added, etc,.,: That would be unclean and as such is not easy to implement also in current codebase, because it will result in various other issues like new/wrong filters being inferred and other messy bugs as the constraint code is sensitive to constraints coming from each node below and the constraints available at current node, to decide whether to create new filters or not. Constrainst are created per operator node ( project, filter etc) and arbitrary putting a limit on constraints at a given operator , will impact the new filters being created. > SPIP: Constraint Propagation code causes OOM issues or increasing compilation > time to hours > --- > > Key: SPARK-33152 > URL: https://issues.apache.org/jira/browse/SPARK-33152 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > Labels: SPIP > Original Estimate: 168h > Remaining Estimate: 168h > > h2. Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon. > Proposing new algorithm to create, store and use constraints for removing > redundant filters & inferring new filters. > The current algorithm has subpar performance in complex expression scenarios > involving aliases( with certain use cases the compilation time can go into > hours), potential to cause OOM, may miss removing redundant filters in > different scenarios, may miss creating IsNotNull constraints in different > scenarios, does not push compound predicates in Join. > # This issue if not fixed can cause OutOfMemory issue or unacceptable query > compilation times. > Have added a test "plan equivalence with case statements and performance > comparison with benefit of more than 10x conservatively" in > org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. > *With this PR the compilation time is 247 ms vs 13958 ms without the change* > # It is more effective in filter pruning as is evident in some of the tests > in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite > where current code is not able to identify the redundant filter in some cases. > # It is able to generate a better optimized plan for join queries as it can > push compound predicates. > # The current logic can miss a lot of possible cases of removing redundant > predicates, as it fails to take into account if same attribute or its aliases > are repeated multiple times in a complex expression. > # There are cases where some of the optimizer rules involving removal of > redundant predicates fail to remove on the basis of constraint data. In some > cases the rule works, just by the virtue of previous rules helping it out to > cover the inaccuracy. That the ConstraintPropagation rule & its function of > removal of redundant filters & addition of new inferred filters is dependent > on the working of some of the other unrelated previous optimizer rules is > behaving, is indicative of issues. > # It does away with all the EqualNullSafe constraints as this logic does not > need those constraints to be created. > # There is at least one test in existing ConstraintPropagationSuite which is > missing a IsNotNull constraints because the code incorrectly generated a > EqualsNullSafeConstraint instead of EqualTo constraint, when using the > existing Constraints code. With these changes, the test correctly creates an > EqualTo constraint, resulting in an inferred IsNotNull constraint > # It does away with the current combinatorial logic of evaluation all the > constraints can cause compilation to run into hours or cause OOM. The number > of constraints stored is exactly the same as the number of filters encountered > h2. Q2. What problem is this proposal NOT designed to solve? > It mainly focuses on compile time performance, but in some cases can benefit > run time characteristics too, like inferring IsNotNull filter or pushing down > compound predicates on the join, which currently may get missed/ does not > happen , respectively, by the present code. > h2. Q3. How is it done today, and what are the limits of current practice? > Current ConstraintsPropagation code, pessimistically tries to gene
[jira] [Updated] (SPARK-47284) We should ensure enough parallelism when ShuffleExchangeLike join with specs without shuffle
[ https://issues.apache.org/jira/browse/SPARK-47284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated SPARK-47284: --- Description: The following case is introduced by https://issues.apache.org/jira/browse/SPARK-35703 // When choosing specs, we should consider those children with no `ShuffleExchangeLike` node // first. For instance, if we have: // A: (No_Exchange, 100) <---> B: (Exchange, 120) // it's better to pick A and change B to (Exchange, 100) instead of picking B and insert a // new shuffle for A. *But we'd better improve it in some cases, for example:* A: (No_Exchange, 2) <---> B: (Exchange, 100) The current logic will change to: A: (No_Exchange, 2) <---> B: (Exchange,2) It actually not ensure enough parallelism, it will reduce the performance i think. was: The following case is introduced by https://issues.apache.org/jira/browse/SPARK-35703 // When choosing specs, we should consider those children with no `ShuffleExchangeLike` node // first. For instance, if we have: // A: (No_Exchange, 100) <---> B: (Exchange, 120) // it's better to pick A and change B to (Exchange, 100) instead of picking B and insert a // new shuffle for A. But we'd better improve it in some cases, for example: A: (No_Exchange, 2) <---> B: (Exchange, 100) The current logic will change to: A: (No_Exchange, 2) <---> B: (Exchange,2) It actually not ensure enough parallelism, it will reduce the performance i think. > We should ensure enough parallelism when ShuffleExchangeLike join with specs > without shuffle > > > Key: SPARK-47284 > URL: https://issues.apache.org/jira/browse/SPARK-47284 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Qi Zhu >Priority: Major > > The following case is introduced by > https://issues.apache.org/jira/browse/SPARK-35703 > // When choosing specs, we should consider those children with no > `ShuffleExchangeLike` node > // first. For instance, if we have: > // A: (No_Exchange, 100) <---> B: (Exchange, 120) > // it's better to pick A and change B to (Exchange, 100) instead of picking B > and insert a > // new shuffle for A. > *But we'd better improve it in some cases, for example:* > A: (No_Exchange, 2) <---> B: (Exchange, 100) > The current logic will change to: > A: (No_Exchange, 2) <---> B: (Exchange,2) > It actually not ensure enough parallelism, it will reduce the performance i > think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47284) We should ensure enough parallelism when ShuffleExchangeLike join with specs without shuffle
Qi Zhu created SPARK-47284: -- Summary: We should ensure enough parallelism when ShuffleExchangeLike join with specs without shuffle Key: SPARK-47284 URL: https://issues.apache.org/jira/browse/SPARK-47284 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Qi Zhu The following case is introduced by https://issues.apache.org/jira/browse/SPARK-35703 // When choosing specs, we should consider those children with no `ShuffleExchangeLike` node // first. For instance, if we have: // A: (No_Exchange, 100) <---> B: (Exchange, 120) // it's better to pick A and change B to (Exchange, 100) instead of picking B and insert a // new shuffle for A. But we'd better improve it in some cases, for example: A: (No_Exchange, 2) <---> B: (Exchange, 100) The current logic will change to: A: (No_Exchange, 2) <---> B: (Exchange,2) It actually not ensure enough parallelism, it will reduce the performance i think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF
[ https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47277: -- Assignee: (was: Apache Spark) > PySpark util function assertDataFrameEqual should not support streaming DF > -- > > Key: SPARK-47277 > URL: https://issues.apache.org/jira/browse/SPARK-47277 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark, SQL, Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Wei Liu >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47283) Remove Spark version drop down to the PySpark doc site
[ https://issues.apache.org/jira/browse/SPARK-47283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47283: -- Assignee: (was: Apache Spark) > Remove Spark version drop down to the PySpark doc site > -- > > Key: SPARK-47283 > URL: https://issues.apache.org/jira/browse/SPARK-47283 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0, 3.5.1 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47283) Remove Spark version drop down to the PySpark doc site
[ https://issues.apache.org/jira/browse/SPARK-47283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47283: -- Assignee: Apache Spark > Remove Spark version drop down to the PySpark doc site > -- > > Key: SPARK-47283 > URL: https://issues.apache.org/jira/browse/SPARK-47283 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0, 3.5.1 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF
[ https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47277: -- Assignee: Apache Spark > PySpark util function assertDataFrameEqual should not support streaming DF > -- > > Key: SPARK-47277 > URL: https://issues.apache.org/jira/browse/SPARK-47277 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark, SQL, Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Wei Liu >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47283) Remove Spark version drop down to the PySpark doc site
[ https://issues.apache.org/jira/browse/SPARK-47283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47283: --- Labels: pull-request-available (was: ) > Remove Spark version drop down to the PySpark doc site > -- > > Key: SPARK-47283 > URL: https://issues.apache.org/jira/browse/SPARK-47283 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0, 3.5.1 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF
[ https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47277: -- Assignee: (was: Apache Spark) > PySpark util function assertDataFrameEqual should not support streaming DF > -- > > Key: SPARK-47277 > URL: https://issues.apache.org/jira/browse/SPARK-47277 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark, SQL, Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Wei Liu >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF
[ https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47277: -- Assignee: Apache Spark > PySpark util function assertDataFrameEqual should not support streaming DF > -- > > Key: SPARK-47277 > URL: https://issues.apache.org/jira/browse/SPARK-47277 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark, SQL, Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Wei Liu >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours
[ https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823506#comment-17823506 ] Ted Chester Jenks commented on SPARK-33152: --- [~ashahid7] I see. This is very painful for us because a bunch of builds that used to work are now hanging indefinitely. Is there any reason there was never a check added to see if the set of constraints was getting too large, or perhaps an optional config you could use to set a max number of constraints? > SPIP: Constraint Propagation code causes OOM issues or increasing compilation > time to hours > --- > > Key: SPARK-33152 > URL: https://issues.apache.org/jira/browse/SPARK-33152 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > Labels: SPIP > Original Estimate: 168h > Remaining Estimate: 168h > > h2. Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon. > Proposing new algorithm to create, store and use constraints for removing > redundant filters & inferring new filters. > The current algorithm has subpar performance in complex expression scenarios > involving aliases( with certain use cases the compilation time can go into > hours), potential to cause OOM, may miss removing redundant filters in > different scenarios, may miss creating IsNotNull constraints in different > scenarios, does not push compound predicates in Join. > # This issue if not fixed can cause OutOfMemory issue or unacceptable query > compilation times. > Have added a test "plan equivalence with case statements and performance > comparison with benefit of more than 10x conservatively" in > org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. > *With this PR the compilation time is 247 ms vs 13958 ms without the change* > # It is more effective in filter pruning as is evident in some of the tests > in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite > where current code is not able to identify the redundant filter in some cases. > # It is able to generate a better optimized plan for join queries as it can > push compound predicates. > # The current logic can miss a lot of possible cases of removing redundant > predicates, as it fails to take into account if same attribute or its aliases > are repeated multiple times in a complex expression. > # There are cases where some of the optimizer rules involving removal of > redundant predicates fail to remove on the basis of constraint data. In some > cases the rule works, just by the virtue of previous rules helping it out to > cover the inaccuracy. That the ConstraintPropagation rule & its function of > removal of redundant filters & addition of new inferred filters is dependent > on the working of some of the other unrelated previous optimizer rules is > behaving, is indicative of issues. > # It does away with all the EqualNullSafe constraints as this logic does not > need those constraints to be created. > # There is at least one test in existing ConstraintPropagationSuite which is > missing a IsNotNull constraints because the code incorrectly generated a > EqualsNullSafeConstraint instead of EqualTo constraint, when using the > existing Constraints code. With these changes, the test correctly creates an > EqualTo constraint, resulting in an inferred IsNotNull constraint > # It does away with the current combinatorial logic of evaluation all the > constraints can cause compilation to run into hours or cause OOM. The number > of constraints stored is exactly the same as the number of filters encountered > h2. Q2. What problem is this proposal NOT designed to solve? > It mainly focuses on compile time performance, but in some cases can benefit > run time characteristics too, like inferring IsNotNull filter or pushing down > compound predicates on the join, which currently may get missed/ does not > happen , respectively, by the present code. > h2. Q3. How is it done today, and what are the limits of current practice? > Current ConstraintsPropagation code, pessimistically tries to generates all > the possible combinations of constraints , based on the aliases ( even then > it may miss a lot of combinations if the expression is a complex expression > involving same attribute repeated multiple times within the expression and > there are many aliases to that column). There are query plans in our > production env, which can result in intermediate number of constraints going > into hundreds of thousands, causing OOM or taking time running into hours. > Also there are cases where it incorrectly generates an EqualNullSafe > constraint instead of Equal
[jira] [Created] (SPARK-47283) Remove Spark version drop down to the PySpark doc site
BingKun Pan created SPARK-47283: --- Summary: Remove Spark version drop down to the PySpark doc site Key: SPARK-47283 URL: https://issues.apache.org/jira/browse/SPARK-47283 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 3.5.1, 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF
[ https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47277: -- Assignee: (was: Apache Spark) > PySpark util function assertDataFrameEqual should not support streaming DF > -- > > Key: SPARK-47277 > URL: https://issues.apache.org/jira/browse/SPARK-47277 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark, SQL, Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Wei Liu >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF
[ https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47277: -- Assignee: Apache Spark > PySpark util function assertDataFrameEqual should not support streaming DF > -- > > Key: SPARK-47277 > URL: https://issues.apache.org/jira/browse/SPARK-47277 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark, SQL, Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Wei Liu >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF
[ https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47277: -- Assignee: (was: Apache Spark) > PySpark util function assertDataFrameEqual should not support streaming DF > -- > > Key: SPARK-47277 > URL: https://issues.apache.org/jira/browse/SPARK-47277 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark, SQL, Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Wei Liu >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47281) Update the `versions. json` file for the already released saprk version
[ https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47281: -- Assignee: (was: Apache Spark) > Update the `versions. json` file for the already released saprk version > --- > > Key: SPARK-47281 > URL: https://issues.apache.org/jira/browse/SPARK-47281 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0, 3.5.1 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF
[ https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47277: -- Assignee: Apache Spark > PySpark util function assertDataFrameEqual should not support streaming DF > -- > > Key: SPARK-47277 > URL: https://issues.apache.org/jira/browse/SPARK-47277 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark, SQL, Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Wei Liu >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47281) Update the `versions. json` file for the already released saprk version
[ https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47281: -- Assignee: Apache Spark > Update the `versions. json` file for the already released saprk version > --- > > Key: SPARK-47281 > URL: https://issues.apache.org/jira/browse/SPARK-47281 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0, 3.5.1 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47281) Update the `versions. json` file for the already released saprk version
[ https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47281: -- Assignee: Apache Spark > Update the `versions. json` file for the already released saprk version > --- > > Key: SPARK-47281 > URL: https://issues.apache.org/jira/browse/SPARK-47281 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0, 3.5.1 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47281) Update the `versions. json` file for the already released saprk version
[ https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47281: -- Assignee: (was: Apache Spark) > Update the `versions. json` file for the already released saprk version > --- > > Key: SPARK-47281 > URL: https://issues.apache.org/jira/browse/SPARK-47281 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0, 3.5.1 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47148) Avoid to materialize AQE ExchangeQueryStageExec on the cancellation
[ https://issues.apache.org/jira/browse/SPARK-47148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47148: -- Assignee: (was: Apache Spark) > Avoid to materialize AQE ExchangeQueryStageExec on the cancellation > --- > > Key: SPARK-47148 > URL: https://issues.apache.org/jira/browse/SPARK-47148 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL >Affects Versions: 4.0.0 >Reporter: Eren Avsarogullari >Priority: Major > Labels: pull-request-available > > AQE can materialize both *ShuffleQueryStage* and *BroadcastQueryStage* on the > cancellation. This causes unnecessary stage materialization by submitting > Shuffle Job and Broadcast Job. Under normal circumstances, if the stage is > already non-materialized (a.k.a *ShuffleQueryStage.shuffleFuture* or > *{{BroadcastQueryStage.broadcastFuture}}* is not initialized yet), it should > just be skipped without materializing it. > Please find sample use-case: > *1- Stage Materialization Steps:* > When stage materialization is failed: > {code:java} > 1.1- ShuffleQueryStage1 - is materialized successfully, > 1.2- ShuffleQueryStage2 - materialization is failed, > 1.3- ShuffleQueryStage3 - Not materialized yet so > ShuffleQueryStage3.shuffleFuture is not initialized yet{code} > *2- Stage Cancellation Steps:* > {code:java} > 2.1- ShuffleQueryStage1 - is canceled due to already materialized, > 2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as > default by AQE because it could not be materialized, > 2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet > but currently, it is also tried to cancel and this stage requires to be > materialized first.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47282) 'parseTableIdentifier' fails when a catalog name is provided
Denis Tarima created SPARK-47282: Summary: 'parseTableIdentifier' fails when a catalog name is provided Key: SPARK-47282 URL: https://issues.apache.org/jira/browse/SPARK-47282 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0, 4.0.0 Reporter: Denis Tarima {code:scala} spark.sessionState.sqlParser.parseTableIdentifier( "`my catalog`.`my database`.`my table`" ) {code} fails with {code:scala} org.apache.spark.sql.catalyst.parser.ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near '.'.(line 1, pos 26) == SQL == `my catalog`.`my database`.`my table` --^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:257) at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:98) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:54) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableIdentifier(AbstractSqlParser.scala:41) {code} Note: It works as expected on Databricks clusters (verified with Spark 3.3.2 and 3.5.0). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string
[ https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-47177: -- Fix Version/s: 3.4.3 > Cached SQL plan do not display final AQE plan in explain string > --- > > Key: SPARK-47177 > URL: https://issues.apache.org/jira/browse/SPARK-47177 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2, 3.5.0, 4.0.0, 3.5.1, 3.5.2 >Reporter: Ziqi Liu >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.3 > > > AQE plan is expected to display final plan after execution. This is not true > for cached SQL plan: it will show the initial plan instead. This behavior > change is introduced in [https://github.com/apache/spark/pull/40812] it tried > to fix the concurrency issue with cached plan. > *In short, the plan used to executed and the plan used to explain is not the > same instance, thus causing the inconsistency.* > > I don't have a clear idea how yet > * maybe we just a coarse granularity lock in explain? > * make innerChildren a function: clone the initial plan, every time checked > for whether the original AQE plan is finalized (making the final flag atomic > first, of course), if no return the cloned initial plan, if it's finalized, > clone the final plan and return that one. But still this won't be able to > reflect the AQE plan in real time, in a concurrent situation, but at least we > have initial version and final version. > > A simple repro: > {code:java} > d1 = spark.range(1000).withColumn("key", expr("id % > 100")).groupBy("key").agg({"key": "count"}) > cached_d2 = d1.cache() > df = cached_d2.filter("key > 10") > df.collect() {code} > {code:java} > >>> df.explain() > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=true > +- == Final Plan == > *(1) Filter (isnotnull(key#4L) AND (key#4L > 10)) > +- TableCacheQueryStage 0 > +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), > (key#4L > 10)] > +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[key#4L], > functions=[count(key#4L)]) > +- Exchange hashpartitioning(key#4L, 200), > ENSURE_REQUIREMENTS, [plan_id=24] > +- HashAggregate(keys=[key#4L], > functions=[partial_count(key#4L)]) > +- Project [(id#2L % 100) AS key#4L] > +- Range (0, 1000, step=1, splits=10) > +- == Initial Plan == > Filter (isnotnull(key#4L) AND (key#4L > 10)) > +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L > > 10)] > +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) > +- Exchange hashpartitioning(key#4L, 200), > ENSURE_REQUIREMENTS, [plan_id=24] > +- HashAggregate(keys=[key#4L], > functions=[partial_count(key#4L)]) > +- Project [(id#2L % 100) AS key#4L] > +- Range (0, 1000, step=1, splits=10){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47281) Update the `versions. json` file for the already released saprk version
[ https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-47281: Affects Version/s: 3.5.1 > Update the `versions. json` file for the already released saprk version > --- > > Key: SPARK-47281 > URL: https://issues.apache.org/jira/browse/SPARK-47281 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0, 3.5.1 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47281) Update the `versions. json` file for the already released saprk version
[ https://issues.apache.org/jira/browse/SPARK-47281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47281: --- Labels: pull-request-available (was: ) > Update the `versions. json` file for the already released saprk version > --- > > Key: SPARK-47281 > URL: https://issues.apache.org/jira/browse/SPARK-47281 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"
[ https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47279: --- Labels: pull-request-available (was: ) > spark driver process hangs due to "unable to create new native thread" > -- > > Key: SPARK-47279 > URL: https://issues.apache.org/jira/browse/SPARK-47279 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 3.1.1, 3.5.0 >Reporter: TianyiMa >Priority: Major > Labels: pull-request-available > Attachments: driver_submit_task.png, executor_4.png > > > we encounter that spark driver hangs for about 11 hours, and finall killed > by user. In the driver log there is an error log: > {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error > happened while processing message in the inbox for CoarseGrainedScheduler > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:719) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) > at > org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) > at > org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) > at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > > After detailed analysis, we found that, the driver submitted task 0.0 at > "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", > then executor 4 sent results to the driver. But in the same time, there is > not sufficient memory in the the server that running the driver, the driver > "unable to create new native thread" to handle the successful result of task > 0.0, then the driver think task 0.0 has not finished and waiting for the > "missed result" forever. > > driver submit task 0.0 > !driver_submit_task.png! > > executor 4 task 0.0 > !executor_4.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47281) Update the `versions. json` file for the already released saprk version
BingKun Pan created SPARK-47281: --- Summary: Update the `versions. json` file for the already released saprk version Key: SPARK-47281 URL: https://issues.apache.org/jira/browse/SPARK-47281 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47280) Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE
[ https://issues.apache.org/jira/browse/SPARK-47280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47280: --- Labels: pull-request-available (was: ) > Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE > - > > Key: SPARK-47280 > URL: https://issues.apache.org/jira/browse/SPARK-47280 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47210) Implicit casting on collated expressions
[ https://issues.apache.org/jira/browse/SPARK-47210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47210: --- Labels: pull-request-available (was: ) > Implicit casting on collated expressions > > > Key: SPARK-47210 > URL: https://issues.apache.org/jira/browse/SPARK-47210 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47280) Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE
Kent Yao created SPARK-47280: Summary: Remove timezone limitation for ORACLE TIMESTAMP WITH TIMEZONE Key: SPARK-47280 URL: https://issues.apache.org/jira/browse/SPARK-47280 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org