[jira] [Updated] (SPARK-44541) Remove useless function `hasRangeExprAgainstEventTimeCol` from `UnsupportedOperationChecker`
[ https://issues.apache.org/jira/browse/SPARK-44541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-44541: - Summary: Remove useless function `hasRangeExprAgainstEventTimeCol` from `UnsupportedOperationChecker` (was: Remove useless funciton `hasRangeExprAgainstEventTimeCol` from `UnsupportedOperationChecker`) > Remove useless function `hasRangeExprAgainstEventTimeCol` from > `UnsupportedOperationChecker` > > > Key: SPARK-44541 > URL: https://issues.apache.org/jira/browse/SPARK-44541 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > > funciton `hasRangeExprAgainstEventTimeCol` was introduced by SPARK-40940 and > no longer be used after SPARK-42376 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44541) Remove useless funciton `hasRangeExprAgainstEventTimeCol` from `UnsupportedOperationChecker`
Yang Jie created SPARK-44541: Summary: Remove useless funciton `hasRangeExprAgainstEventTimeCol` from `UnsupportedOperationChecker` Key: SPARK-44541 URL: https://issues.apache.org/jira/browse/SPARK-44541 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yang Jie funciton `hasRangeExprAgainstEventTimeCol` was introduced by SPARK-40940 and no longer be used after SPARK-42376 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 5:41 AM: After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) seems produces a sorted output as expected, even without the workaround. (And I found that this option cannot be set in code directly. It must be set in spark-submit. This config option is also undocumented.) was (Author: JIRAUSER301473): After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) seems produces a sorted output. (And I found that this option cannot be set in code directly. It must be set in spark-submit. This config option is also undocumented) > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Description: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}} Below is the complete code that reproduces the problem. was: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found then when AQE is enabled, the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}} Below is the complete code that reproduces the problem. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44540) Remove unused stylesheet and javascript files of jsonFormatter
[ https://issues.apache.org/jira/browse/SPARK-44540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746781#comment-17746781 ] ci-cassandra.apache.org commented on SPARK-44540: - User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/42145 > Remove unused stylesheet and javascript files of jsonFormatter > -- > > Key: SPARK-44540 > URL: https://issues.apache.org/jira/browse/SPARK-44540 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.5.0 >Reporter: Kent Yao >Priority: Major > > jsonFormatter.min.css and jsonFormatter.min.js is unreached -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44454) HiveShim getTablesByType support fallback
[ https://issues.apache.org/jira/browse/SPARK-44454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746751#comment-17746751 ] ci-cassandra.apache.org commented on SPARK-44454: - User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/42033 > HiveShim getTablesByType support fallback > - > > Key: SPARK-44454 > URL: https://issues.apache.org/jira/browse/SPARK-44454 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: dzcxzl >Priority: Minor > > When we use a high version of Hive Client to communicate with a low version > of Hive meta store, we may encounter Invalid method name: > 'get_tables_by_type'. > > {code:java} > 23/07/17 12:45:24,391 [main] DEBUG SparkSqlParser: Parsing command: show views > 23/07/17 12:45:24,489 [main] ERROR log: Got exception: > org.apache.thrift.TApplicationException Invalid method name: > 'get_tables_by_type' > org.apache.thrift.TApplicationException: Invalid method name: > 'get_tables_by_type' > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_tables_by_type(ThriftHiveMetastore.java:1433) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_tables_by_type(ThriftHiveMetastore.java:1418) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTables(HiveMetaStoreClient.java:1411) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) > at com.sun.proxy.$Proxy23.getTables(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2344) > at com.sun.proxy.$Proxy23.getTables(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getTablesByType(Hive.java:1427) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.sql.hive.client.Shim_v2_3.getTablesByType(HiveShim.scala:1408) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274) > at > org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:785) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:895) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:108) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listViews(HiveExternalCatalog.scala:893) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listViews(ExternalCatalogWithListener.scala:158) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listViews(SessionCatalog.scala:1040) > at > org.apache.spark.sql.execution.command.ShowViewsCommand.$anonfun$run$5(views.scala:407) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.command.ShowViewsCommand.run(views.scala:407) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44523) Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral
[ https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44523. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42126 [https://github.com/apache/spark/pull/42126] > Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral > -- > > Key: SPARK-44523 > URL: https://issues.apache.org/jira/browse/SPARK-44523 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44523) Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral
[ https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44523: Assignee: Yuming Wang > Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral > -- > > Key: SPARK-44523 > URL: https://issues.apache.org/jira/browse/SPARK-44523 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44540) Remove unused stylesheet and javascript files of jsonFormatter
Kent Yao created SPARK-44540: Summary: Remove unused stylesheet and javascript files of jsonFormatter Key: SPARK-44540 URL: https://issues.apache.org/jira/browse/SPARK-44540 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.5.0 Reporter: Kent Yao jsonFormatter.min.css and jsonFormatter.min.js is unreached -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44466) Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX from modifiedConfigs
[ https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44466. - Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42049 [https://github.com/apache/spark/pull/42049] > Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX > from modifiedConfigs > > > Key: SPARK-44466 > URL: https://issues.apache.org/jira/browse/SPARK-44466 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.5.0, 4.0.0 > > Attachments: screenshot-1.png > > > Should not include this value: > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44466) Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX from modifiedConfigs
[ https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44466: --- Assignee: Yuming Wang > Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX > from modifiedConfigs > > > Key: SPARK-44466 > URL: https://issues.apache.org/jira/browse/SPARK-44466 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > Should not include this value: > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44509) Fine grained interrupt in Python Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-44509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44509. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42120 [https://github.com/apache/spark/pull/42120] > Fine grained interrupt in Python Spark Connect > -- > > Key: SPARK-44509 > URL: https://issues.apache.org/jira/browse/SPARK-44509 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Same as https://issues.apache.org/jira/browse/SPARK-44422 but need it for > Python > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44509) Fine grained interrupt in Python Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-44509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44509: Assignee: Hyukjin Kwon > Fine grained interrupt in Python Spark Connect > -- > > Key: SPARK-44509 > URL: https://issues.apache.org/jira/browse/SPARK-44509 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Same as https://issues.apache.org/jira/browse/SPARK-44422 but need it for > Python > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44450) Make direct Arrow encoding work with SQL/API
[ https://issues.apache.org/jira/browse/SPARK-44450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-44450: - Assignee: Herman van Hövell > Make direct Arrow encoding work with SQL/API > > > Key: SPARK-44450 > URL: https://issues.apache.org/jira/browse/SPARK-44450 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Component/s: Optimizer > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 1:33 AM: After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) seems produces a sorted output. (And I found that this option cannot be set in code directly. It must be set in spark-submit. This config option is also undocumented) was (Author: JIRAUSER301473): After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) seems produces a sorted output. (And I found that this option cannot be set in code directly. It must be set in spark-submit) > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44539) Upgrade RoaringBitmap to 0.9.46
BingKun Pan created SPARK-44539: --- Summary: Upgrade RoaringBitmap to 0.9.46 Key: SPARK-44539 URL: https://issues.apache.org/jira/browse/SPARK-44539 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44538) Remove ToJsonUtil
Herman van Hövell created SPARK-44538: - Summary: Remove ToJsonUtil Key: SPARK-44538 URL: https://issues.apache.org/jira/browse/SPARK-44538 Project: Spark Issue Type: New Feature Components: Connect, SQL Affects Versions: 3.4.1 Reporter: Herman van Hövell Assignee: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 1:06 AM: After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) seems produces a sorted output. (And I found that this option cannot be set in code directly. It must be set in spark-submit) was (Author: JIRAUSER301473): After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) seems produces a sorted output. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 1:05 AM: After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) seems produces a sorted output. was (Author: JIRAUSER301473): -After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) seems produces a sorted output.- > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512 ] Yiu-Chung Lee deleted comment on SPARK-44512: --- was (Author: JIRAUSER301473): No. After testing another production data, spark.sql.optimizer.plannedWrite.enabled=false does not solve the problem either. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 1:03 AM: -After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) seems produces a sorted output.- was (Author: JIRAUSER301473): After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) seems produces a sorted output. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746715#comment-17746715 ] Yiu-Chung Lee commented on SPARK-44512: --- No. After testing another production data, spark.sql.optimizer.plannedWrite.enabled=false does not solve the problem either. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44449) Add upcasting to Arrow deserializers
[ https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-9. --- Fix Version/s: 3.5.0 Assignee: Herman van Hövell Resolution: Fixed > Add upcasting to Arrow deserializers > > > Key: SPARK-9 > URL: https://issues.apache.org/jira/browse/SPARK-9 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44537) Upgrade kubernetes-client to 6.8.0
BingKun Pan created SPARK-44537: --- Summary: Upgrade kubernetes-client to 6.8.0 Key: SPARK-44537 URL: https://issues.apache.org/jira/browse/SPARK-44537 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44486) Implement PyArrow `self_destruct` feature for `toPandas`
[ https://issues.apache.org/jira/browse/SPARK-44486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44486: Assignee: Xinrong Meng > Implement PyArrow `self_destruct` feature for `toPandas` > > > Key: SPARK-44486 > URL: https://issues.apache.org/jira/browse/SPARK-44486 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Implement PyArrow `self_destruct` feature for `toPandas` > To make the Spark configuration > `spark.sql.execution.arrow.pyspark.selfDestruct.enabled` be used to enable > PyArrow’s `self_destruct` feature in Spark Connect, which can save memory > when creating a Pandas DataFrame via `toPandas` by freeing Arrow-allocated > memory while building the Pandas DataFrame. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44486) Implement PyArrow `self_destruct` feature for `toPandas`
[ https://issues.apache.org/jira/browse/SPARK-44486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44486. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42079 [https://github.com/apache/spark/pull/42079] > Implement PyArrow `self_destruct` feature for `toPandas` > > > Key: SPARK-44486 > URL: https://issues.apache.org/jira/browse/SPARK-44486 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Implement PyArrow `self_destruct` feature for `toPandas` > To make the Spark configuration > `spark.sql.execution.arrow.pyspark.selfDestruct.enabled` be used to enable > PyArrow’s `self_destruct` feature in Spark Connect, which can save memory > when creating a Pandas DataFrame via `toPandas` by freeing Arrow-allocated > memory while building the Pandas DataFrame. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44536) Upgrade sbt to 1.9.3
BingKun Pan created SPARK-44536: --- Summary: Upgrade sbt to 1.9.3 Key: SPARK-44536 URL: https://issues.apache.org/jira/browse/SPARK-44536 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44535) Move Streaming API to sql/api
Herman van Hövell created SPARK-44535: - Summary: Move Streaming API to sql/api Key: SPARK-44535 URL: https://issues.apache.org/jira/browse/SPARK-44535 Project: Spark Issue Type: New Feature Components: Connect, SQL Affects Versions: 3.4.1 Reporter: Herman van Hövell Assignee: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44500) parse_url treats key as regular expression
[ https://issues.apache.org/jira/browse/SPARK-44500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746705#comment-17746705 ] Pablo Langa Blanco commented on SPARK-44500: [~jan.chou...@gmail.com] What do you think? > parse_url treats key as regular expression > -- > > Key: SPARK-44500 > URL: https://issues.apache.org/jira/browse/SPARK-44500 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.4.1 >Reporter: Robert Joseph Evans >Priority: Major > > To be clear I am not 100% sure that this is a bug. It might be a feature, but > I don't see anywhere that it is used as a feature. If it is a feature it > really should be documented, because there are pitfalls. If it is a bug it > should be fixed because it is really confusing and it is simple to shoot > yourself in the foot. > ```scala > > val urls = Seq("http://foo/bar?abc=BAD&a.c=GOOD";, > > "http://foo/bar?a.c=GOOD&abc=BAD";).toDF > > urls.selectExpr("parse_url(value, 'QUERY', 'a.c')").show(false) > ++ > |parse_url(value, QUERY, a.c)| > ++ > |BAD | > |GOOD| > ++ > > urls.selectExpr("parse_url(value, 'QUERY', 'a[c')").show(false) > java.util.regex.PatternSyntaxException: Unclosed character class near index 15 > (&|^)a[c=([^&]*) >^ > at java.util.regex.Pattern.error(Pattern.java:1969) > at java.util.regex.Pattern.clazz(Pattern.java:2562) > at java.util.regex.Pattern.sequence(Pattern.java:2077) > at java.util.regex.Pattern.expr(Pattern.java:2010) > at java.util.regex.Pattern.compile(Pattern.java:1702) > at java.util.regex.Pattern.(Pattern.java:1352) > at java.util.regex.Pattern.compile(Pattern.java:1028) > ``` > The simple fix is to quote the key when making the pattern. > ```scala > private def getPattern(key: UTF8String): Pattern = { > Pattern.compile(REGEXPREFIX + Pattern.quote(key.toString) + REGEXSUBFIX) > } > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/24/23 11:32 PM: - After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false seems produces a sorted output. was (Author: JIRAUSER301473): Setting spark.sql.optimizer.plannedWrite.enabled=false seems produces a sorted output. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/24/23 11:32 PM: - After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) seems produces a sorted output. was (Author: JIRAUSER301473): After reading SPARK-41914, I found that setting spark.sql.optimizer.plannedWrite.enabled=false seems produces a sorted output. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/24/23 11:29 PM: - Setting spark.sql.optimizer.plannedWrite.enabled=false seems produces a sorted output. was (Author: JIRAUSER301473): Setting {{spark.sql.optimizer.plannedWrite.enabled=false seems produces a sorted output.}} > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697 ] Yiu-Chung Lee commented on SPARK-44512: --- Setting {{spark.sql.optimizer.plannedWrite.enabled=false seems produces a sorted output.}} > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44534) Handle only shuffle files in KubernetesLocalDiskShuffleExecutorComponents
Dongjoon Hyun created SPARK-44534: - Summary: Handle only shuffle files in KubernetesLocalDiskShuffleExecutorComponents Key: SPARK-44534 URL: https://issues.apache.org/jira/browse/SPARK-44534 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.5.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44503) Support PARTITION BY and ORDER BY clause for table arguments
[ https://issues.apache.org/jira/browse/SPARK-44503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-44503. --- Fix Version/s: 4.0.0 Assignee: Daniel Resolution: Fixed Issue resolved by pull request 42100 https://github.com/apache/spark/pull/42100 > Support PARTITION BY and ORDER BY clause for table arguments > > > Key: SPARK-44503 > URL: https://issues.apache.org/jira/browse/SPARK-44503 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44455) SHOW CREATE TABLE does not quote identifiers with special characters
[ https://issues.apache.org/jira/browse/SPARK-44455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-44455: -- Assignee: Runyao.Chen > SHOW CREATE TABLE does not quote identifiers with special characters > > > Key: SPARK-44455 > URL: https://issues.apache.org/jira/browse/SPARK-44455 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: Runyao.Chen >Assignee: Runyao.Chen >Priority: Major > > Create a table with special characters: > ``` > CREATE CATALOG `a_catalog-with+special^chars`; CREATE SCHEMA > `a_catalog-with+special^chars`.`a_schema-with+special^chars`; CREATE TABLE > `a_catalog-with+special^chars`.`a_schema-with+special^chars`.`table1` ( id > int, feat1 varchar(255), CONSTRAINT pk PRIMARY KEY (id,feat1) ); > ``` > Then run SHOW CREATE TABLE: > ``` > SHOW CREATE TABLE > `a_catalog-with+special^chars`.`a_schema-with+special^chars`.`table1`; > ``` > The response is: > ``` > createtab_stmt "CREATE TABLE > a_catalog-with+special^chars.a_schema-with+special^chars.table1 ( id INT NOT > NULL, feat1 VARCHAR(255) NOT NULL, CONSTRAINT pk PRIMARY KEY (id, feat1)) > USING delta TBLPROPERTIES ( 'delta.minReaderVersion' = '1', > 'delta.minWriterVersion' = '2') " > ``` > As you can see, the table name in the response is not properly escaped with > backticks. As a result, if a user copies and pastes this create table command > to recreate the table, it will fail: > {{[INVALID_IDENTIFIER] The identifier a_catalog-with is invalid. Please, > consider quoting it with back-quotes as `a_catalog-with`.(line 1, pos 22)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44516) Spark Connect Python StreamingQueryListener removeListener method actually shut down the listener process
[ https://issues.apache.org/jira/browse/SPARK-44516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-44516: Summary: Spark Connect Python StreamingQueryListener removeListener method actually shut down the listener process (was: Spark Connect Python StreamingQueryListener removeListener method) > Spark Connect Python StreamingQueryListener removeListener method actually > shut down the listener process > - > > Key: SPARK-44516 > URL: https://issues.apache.org/jira/browse/SPARK-44516 > Project: Spark > Issue Type: New Feature > Components: Connect, Structured Streaming >Affects Versions: 3.5.0, 4.0.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44455) SHOW CREATE TABLE does not quote identifiers with special characters
[ https://issues.apache.org/jira/browse/SPARK-44455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-44455. Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 42034 [https://github.com/apache/spark/pull/42034] > SHOW CREATE TABLE does not quote identifiers with special characters > > > Key: SPARK-44455 > URL: https://issues.apache.org/jira/browse/SPARK-44455 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: Runyao.Chen >Assignee: Runyao.Chen >Priority: Major > Fix For: 3.5.0 > > > Create a table with special characters: > ``` > CREATE CATALOG `a_catalog-with+special^chars`; CREATE SCHEMA > `a_catalog-with+special^chars`.`a_schema-with+special^chars`; CREATE TABLE > `a_catalog-with+special^chars`.`a_schema-with+special^chars`.`table1` ( id > int, feat1 varchar(255), CONSTRAINT pk PRIMARY KEY (id,feat1) ); > ``` > Then run SHOW CREATE TABLE: > ``` > SHOW CREATE TABLE > `a_catalog-with+special^chars`.`a_schema-with+special^chars`.`table1`; > ``` > The response is: > ``` > createtab_stmt "CREATE TABLE > a_catalog-with+special^chars.a_schema-with+special^chars.table1 ( id INT NOT > NULL, feat1 VARCHAR(255) NOT NULL, CONSTRAINT pk PRIMARY KEY (id, feat1)) > USING delta TBLPROPERTIES ( 'delta.minReaderVersion' = '1', > 'delta.minWriterVersion' = '2') " > ``` > As you can see, the table name in the response is not properly escaped with > backticks. As a result, if a user copies and pastes this create table command > to recreate the table, it will fail: > {{[INVALID_IDENTIFIER] The identifier a_catalog-with is invalid. Please, > consider quoting it with back-quotes as `a_catalog-with`.(line 1, pos 22)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44533) Add support for accumulator, broadcast, and Spark files in Python UDTF's analyze.
Takuya Ueshin created SPARK-44533: - Summary: Add support for accumulator, broadcast, and Spark files in Python UDTF's analyze. Key: SPARK-44533 URL: https://issues.apache.org/jira/browse/SPARK-44533 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44532) Move ArrowUtil to sql/api
Herman van Hövell created SPARK-44532: - Summary: Move ArrowUtil to sql/api Key: SPARK-44532 URL: https://issues.apache.org/jira/browse/SPARK-44532 Project: Spark Issue Type: New Feature Components: Connect, SQL Affects Versions: 3.4.1 Reporter: Herman van Hövell Assignee: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44531) Move encoder inference to sql/api
Herman van Hövell created SPARK-44531: - Summary: Move encoder inference to sql/api Key: SPARK-44531 URL: https://issues.apache.org/jira/browse/SPARK-44531 Project: Spark Issue Type: New Feature Components: Connect, SQL Affects Versions: 3.4.1 Reporter: Herman van Hövell Assignee: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44530) Move SparkBuildInfo to common/util
Herman van Hövell created SPARK-44530: - Summary: Move SparkBuildInfo to common/util Key: SPARK-44530 URL: https://issues.apache.org/jira/browse/SPARK-44530 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.4.1 Reporter: Herman van Hövell Assignee: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44529) Add a flag to resolve docker tags to hashes at launch time
Holden Karau created SPARK-44529: Summary: Add a flag to resolve docker tags to hashes at launch time Key: SPARK-44529 URL: https://issues.apache.org/jira/browse/SPARK-44529 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.5.0, 4.0.0 Reporter: Holden Karau If you have an Spark docker tag (like say 3.3) you might want to update the container but only for newly launched jobs not existing jobs. To allow this we can resolve the tag to hash at launch time. In some environments this may also offer a small performance improvement as it saves K8s from having to re-resolve the tag with additional executor launches. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44528) Spark Connect DataFrame does not allow to add custom instance attributes and check for it
[ https://issues.apache.org/jira/browse/SPARK-44528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Grund updated SPARK-44528: - Summary: Spark Connect DataFrame does not allow to add custom instance attributes and check for it (was: Spark Connect DataFrame does not allow to add custom instance attributes) > Spark Connect DataFrame does not allow to add custom instance attributes and > check for it > - > > Key: SPARK-44528 > URL: https://issues.apache.org/jira/browse/SPARK-44528 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.1 >Reporter: Martin Grund >Priority: Major > > ``` > df = spark.range(10) > df._test = 10 > ``` > Treats `df._test` like a column -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44528) Spark Connect DataFrame does not allow to add custom instance attributes and check for it
[ https://issues.apache.org/jira/browse/SPARK-44528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Grund updated SPARK-44528: - Description: ``` df = spark.range(10) df._test = 10 assert(hasattr(df, "_test")) assert(!hasattr(df, "_test_no")) ``` Treats `df._test` like a column was: ``` df = spark.range(10) df._test = 10 ``` Treats `df._test` like a column > Spark Connect DataFrame does not allow to add custom instance attributes and > check for it > - > > Key: SPARK-44528 > URL: https://issues.apache.org/jira/browse/SPARK-44528 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.1 >Reporter: Martin Grund >Priority: Major > > ``` > df = spark.range(10) > df._test = 10 > assert(hasattr(df, "_test")) > assert(!hasattr(df, "_test_no")) > ``` > Treats `df._test` like a column -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44528) Spark Connect DataFrame does not allow to add custom instance attributes
Martin Grund created SPARK-44528: Summary: Spark Connect DataFrame does not allow to add custom instance attributes Key: SPARK-44528 URL: https://issues.apache.org/jira/browse/SPARK-44528 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.4.1 Reporter: Martin Grund ``` ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44528) Spark Connect DataFrame does not allow to add custom instance attributes
[ https://issues.apache.org/jira/browse/SPARK-44528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Grund updated SPARK-44528: - Description: ``` df = spark.range(10) df._test = 10 ``` Treats `df._test` like a column was: ``` ``` > Spark Connect DataFrame does not allow to add custom instance attributes > > > Key: SPARK-44528 > URL: https://issues.apache.org/jira/browse/SPARK-44528 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.1 >Reporter: Martin Grund >Priority: Major > > ``` > df = spark.range(10) > df._test = 10 > ``` > Treats `df._test` like a column -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746516#comment-17746516 ] Ramakrishna edited comment on SPARK-44152 at 7/24/23 4:06 PM: -- Hello [~sdehaes] It should work if you copy jar to /usr/local/bin folder of your docker container . It worked for us was (Author: hande): Hello [~sdehaes] It should work if you copy jar to /usr/local/bin folder . It worked for us > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746516#comment-17746516 ] Ramakrishna commented on SPARK-44152: - Hello [~sdehaes] It should work if you copy jar to /usr/local/bin folder . It worked for us > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44527) Simplify BinaryComparison if its children contain ScalarSubquery with empty output
[ https://issues.apache.org/jira/browse/SPARK-44527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746488#comment-17746488 ] Yuming Wang commented on SPARK-44527: - https://github.com/apache/spark/pull/42129 > Simplify BinaryComparison if its children contain ScalarSubquery with empty > output > -- > > Key: SPARK-44527 > URL: https://issues.apache.org/jira/browse/SPARK-44527 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746025#comment-17746025 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/24/23 2:38 PM: Here is the [gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that reproduces the issue. To compile: javac Test.java && jar cvf Test.jar Test.class bug reproduce: spark-submit --class Test Test.jar no bug if workaround is enabled: spark-submit --class Test Test.jar workaround -no bug too if AQE is disabled: spark-submit --conf spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each partition key)- was (Author: JIRAUSER301473): Here is the [gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that reproduces the issue. To compile: javac Test.java && jar cvf Test.jar Test.class bug reproduce: spark-submit --class Test Test.jar no bug if workaround is enabled: spark-submit --class Test Test.jar workaround no bug too if AQE is disabled: spark-submit --conf spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each partition key) > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746483#comment-17746483 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/24/23 2:38 PM: After further test by running on my production data, I found that disabling AQE actually still does not produce sorted result. was (Author: JIRAUSER301473): After further test, disabling AQE actually still does not produce a sorted result. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Summary: dataset.sort.select.write.partitionBy does not return a sorted output (was: dataset.sort.select.write.partitionBy does not return a sorted output if AQE is enabled) > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output if AQE is enabled
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746483#comment-17746483 ] Yiu-Chung Lee commented on SPARK-44512: --- After further test, disabling AQE actually still does not produce a sorted result. > dataset.sort.select.write.partitionBy does not return a sorted output if AQE > is enabled > --- > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44527) Simplify BinaryComparison if its children contain ScalarSubquery with empty output
Yuming Wang created SPARK-44527: --- Summary: Simplify BinaryComparison if its children contain ScalarSubquery with empty output Key: SPARK-44527 URL: https://issues.apache.org/jira/browse/SPARK-44527 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44513) Upgrade snappy-java to 1.1.10.3
[ https://issues.apache.org/jira/browse/SPARK-44513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-44513: - Affects Version/s: 3.4.1 (was: 4.0.0) > Upgrade snappy-java to 1.1.10.3 > --- > > Key: SPARK-44513 > URL: https://issues.apache.org/jira/browse/SPARK-44513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Trivial > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44513) Upgrade snappy-java to 1.1.10.3
[ https://issues.apache.org/jira/browse/SPARK-44513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-44513: - Fix Version/s: (was: 4.0.0) > Upgrade snappy-java to 1.1.10.3 > --- > > Key: SPARK-44513 > URL: https://issues.apache.org/jira/browse/SPARK-44513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Trivial > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone
[ https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Faiz Halde updated SPARK-44526: --- Description: Hi, This ticket is meant to understand the work that would be involved in porting the k8s PVC reuse feature onto the spark standalone cluster manager which reuses the shuffle files present locally in the disk We are a heavy user of spot instances and we suffer from spot terminations impacting our long running jobs The logic in `KubernetesLocalDiskShuffleDataIO` itself is not that much. However when I tried this on the `LocalDiskShuffleExecutorComponents` it was not a successful experiment which suggests there is more to it I'd like to understand what will be the work involved for this. We'll be more than happy to contribute was: Hi, This ticket is meant to understand the work that would be involved in porting the k8s PVC reuse feature onto the spark standalone cluster manager which reuses the shuffle files present locally in the disk We are a heavy user of spot instances and we suffer from spot terminations impacting our long running jobs The logic in KubernetesLocalDiskShuffleDataIO itself is not that much. However when I tried this on the `LocalDiskShuffleExecutorComponents` it was not a successful experiment which suggests there is more to it I'd like to understand what will be the work involved for this. We'll be more than happy to contribute > Porting k8s PVC reuse logic to spark standalone > --- > > Key: SPARK-44526 > URL: https://issues.apache.org/jira/browse/SPARK-44526 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.4.1 >Reporter: Faiz Halde >Priority: Major > > Hi, > This ticket is meant to understand the work that would be involved in porting > the k8s PVC reuse feature onto the spark standalone cluster manager which > reuses the shuffle files present locally in the disk > We are a heavy user of spot instances and we suffer from spot terminations > impacting our long running jobs > The logic in `KubernetesLocalDiskShuffleDataIO` > itself is not that much. However when I tried this on the > `LocalDiskShuffleExecutorComponents` it was not a successful experiment which > suggests there is more to it > I'd like to understand what will be the work involved for this. We'll be more > than happy to contribute -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone
[ https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Faiz Halde updated SPARK-44526: --- Description: Hi, This ticket is meant to understand the work that would be involved in porting the k8s PVC reuse feature onto the spark standalone cluster manager which reuses the shuffle files present locally in the disk We are a heavy user of spot instances and we suffer from spot terminations impacting our long running jobs The logic in `KubernetesLocalDiskShuffleExecutorComponents` itself is not that much. However when I tried this on the `LocalDiskShuffleExecutorComponents` it was not a successful experiment which suggests there is more to recovering shuffle files I'd like to understand what will be the work involved for this. We'll be more than happy to contribute was: Hi, This ticket is meant to understand the work that would be involved in porting the k8s PVC reuse feature onto the spark standalone cluster manager which reuses the shuffle files present locally in the disk We are a heavy user of spot instances and we suffer from spot terminations impacting our long running jobs The logic in `KubernetesLocalDiskShuffleDataIO` itself is not that much. However when I tried this on the `LocalDiskShuffleExecutorComponents` it was not a successful experiment which suggests there is more to it I'd like to understand what will be the work involved for this. We'll be more than happy to contribute > Porting k8s PVC reuse logic to spark standalone > --- > > Key: SPARK-44526 > URL: https://issues.apache.org/jira/browse/SPARK-44526 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.4.1 >Reporter: Faiz Halde >Priority: Major > > Hi, > This ticket is meant to understand the work that would be involved in porting > the k8s PVC reuse feature onto the spark standalone cluster manager which > reuses the shuffle files present locally in the disk > We are a heavy user of spot instances and we suffer from spot terminations > impacting our long running jobs > The logic in `KubernetesLocalDiskShuffleExecutorComponents` itself is not > that much. However when I tried this on the > `LocalDiskShuffleExecutorComponents` it was not a successful experiment which > suggests there is more to recovering shuffle files > I'd like to understand what will be the work involved for this. We'll be more > than happy to contribute -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone
[ https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Faiz Halde updated SPARK-44526: --- Description: Hi, This ticket is meant to understand the work that would be involved in porting the k8s PVC reuse feature onto the spark standalone cluster manager which reuses the shuffle files present locally in the disk We are a heavy user of spot instances and we suffer from spot terminations impacting our long running jobs The logic in KubernetesLocalDiskShuffleDataIO itself is not that much. However when I tried this on the `LocalDiskShuffleExecutorComponents` it was not a successful experiment which suggests there is more to it I'd like to understand what will be the work involved for this. We'll be more than happy to contribute was: Hi, This ticket is meant to understand the work that would be involved in porting the PVC reuse feature onto the spark standalone cluster manager The logic in KubernetesLocalDiskShuffleDataIO itself is not that much. However when I tried this on the `LocalDiskShuffleExecutorComponents` it was not a successful experiment which suggests there is more to it I'd like to understand what will be the work involved for this. We'll be more than happy to contribute > Porting k8s PVC reuse logic to spark standalone > --- > > Key: SPARK-44526 > URL: https://issues.apache.org/jira/browse/SPARK-44526 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.4.1 >Reporter: Faiz Halde >Priority: Major > > Hi, > This ticket is meant to understand the work that would be involved in porting > the k8s PVC reuse feature onto the spark standalone cluster manager which > reuses the shuffle files present locally in the disk > We are a heavy user of spot instances and we suffer from spot terminations > impacting our long running jobs > The logic in > KubernetesLocalDiskShuffleDataIO > itself is not that much. However when I tried this on the > `LocalDiskShuffleExecutorComponents` it was not a successful experiment which > suggests there is more to it > I'd like to understand what will be the work involved for this. We'll be more > than happy to contribute -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone
Faiz Halde created SPARK-44526: -- Summary: Porting k8s PVC reuse logic to spark standalone Key: SPARK-44526 URL: https://issues.apache.org/jira/browse/SPARK-44526 Project: Spark Issue Type: New Feature Components: Shuffle, Spark Core Affects Versions: 3.4.1 Reporter: Faiz Halde Hi, This ticket is meant to understand the work that would be involved in porting the PVC reuse feature onto the spark standalone cluster manager The logic in KubernetesLocalDiskShuffleDataIO itself is not that much. However when I tried this on the `LocalDiskShuffleExecutorComponents` it was not a successful experiment which suggests there is more to it I'd like to understand what will be the work involved for this. We'll be more than happy to contribute -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44525) Improve error message when Invoke method is not found
Cheng Pan created SPARK-44525: - Summary: Improve error message when Invoke method is not found Key: SPARK-44525 URL: https://issues.apache.org/jira/browse/SPARK-44525 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.2 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746436#comment-17746436 ] Stijn De Haes commented on SPARK-44152: --- We are seeing this issue too, the problem seems to be this PR: [https://github.com/apache/spark/pull/37417] When building an image we copy the jar's into the workdir location, however now when the job is running spark will remove everything in that workdir location. Resulting in this error. I am not sure on how to continue, what would be the best location to copy the assembly jar? > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44519) SparkConnectServerUtils generated incorrect parameters for jars
[ https://issues.apache.org/jira/browse/SPARK-44519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-44519. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42121 [https://github.com/apache/spark/pull/42121] > SparkConnectServerUtils generated incorrect parameters for jars > --- > > Key: SPARK-44519 > URL: https://issues.apache.org/jira/browse/SPARK-44519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > SparkConnectServerUtils generate multiple --jars parameters. It will cause > the bug that doesn't find out the class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44519) SparkConnectServerUtils generated incorrect parameters for jars
[ https://issues.apache.org/jira/browse/SPARK-44519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-44519: Assignee: jiaan.geng > SparkConnectServerUtils generated incorrect parameters for jars > --- > > Key: SPARK-44519 > URL: https://issues.apache.org/jira/browse/SPARK-44519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > SparkConnectServerUtils generate multiple --jars parameters. It will cause > the bug that doesn't find out the class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44521) `SparkConnectServiceSuite` has directory residue after testing
[ https://issues.apache.org/jira/browse/SPARK-44521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-44521: Assignee: Yang Jie > `SparkConnectServiceSuite` has directory residue after testing > -- > > Key: SPARK-44521 > URL: https://issues.apache.org/jira/browse/SPARK-44521 > Project: Spark > Issue Type: Improvement > Components: Connect, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > run > > > {code:java} > build/sbt "connect/testOnly > org.apache.spark.sql.connect.planner.SparkConnectServiceSuite" > git status {code} > > There are residual directories as follows > > {code:java} > connector/connect/server/282ce745-440f-44ac-9f43-4fad70d89a44/ > connector/connect/server/my/ {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44521) `SparkConnectServiceSuite` has directory residue after testing
[ https://issues.apache.org/jira/browse/SPARK-44521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-44521. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42122 [https://github.com/apache/spark/pull/42122] > `SparkConnectServiceSuite` has directory residue after testing > -- > > Key: SPARK-44521 > URL: https://issues.apache.org/jira/browse/SPARK-44521 > Project: Spark > Issue Type: Improvement > Components: Connect, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0, 4.0.0 > > > run > > > {code:java} > build/sbt "connect/testOnly > org.apache.spark.sql.connect.planner.SparkConnectServiceSuite" > git status {code} > > There are residual directories as follows > > {code:java} > connector/connect/server/282ce745-440f-44ac-9f43-4fad70d89a44/ > connector/connect/server/my/ {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43831) Build and Run Spark on Java 21
[ https://issues.apache.org/jira/browse/SPARK-43831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746304#comment-17746304 ] Dongjoon Hyun commented on SPARK-43831: --- According to the assessment result (up to now), I switched the Target Version from 3.5.0 to 4.0.0 because we need the next version of Apache Arrow dependency. > Build and Run Spark on Java 21 > -- > > Key: SPARK-43831 > URL: https://issues.apache.org/jira/browse/SPARK-43831 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Priority: Major > > - [https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html] > ||JDK version||Minimum Scala versions|| > |21 (ea)|3.3.1 (soon), 2.13.11, 2.12.18| > |20|3.3.0, 2.13.11, 2.12.18| > |19|3.2.0, 2.13.9, 2.12.16| > |18|3.1.3, 2.13.7, 2.12.15| > |17 (LTS)|3.0.0, 2.13.6, 2.12.15| > |11 (LTS)|3.0.0, 2.13.0, 2.12.4, 2.11.12| > |8 (LTS)|3.0.0, 2.13.0, 2.12.0, 2.11.0| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43831) Build and Run Spark on Java 21
[ https://issues.apache.org/jira/browse/SPARK-43831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-43831: -- Target Version/s: 4.0.0 (was: 3.5.0) > Build and Run Spark on Java 21 > -- > > Key: SPARK-43831 > URL: https://issues.apache.org/jira/browse/SPARK-43831 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Priority: Major > > - [https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html] > ||JDK version||Minimum Scala versions|| > |21 (ea)|3.3.1 (soon), 2.13.11, 2.12.18| > |20|3.3.0, 2.13.11, 2.12.18| > |19|3.2.0, 2.13.9, 2.12.16| > |18|3.1.3, 2.13.7, 2.12.15| > |17 (LTS)|3.0.0, 2.13.6, 2.12.15| > |11 (LTS)|3.0.0, 2.13.0, 2.12.4, 2.11.12| > |8 (LTS)|3.0.0, 2.13.0, 2.12.0, 2.11.0| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44524) Add a new test group for pyspark-pandas-slow-connect module
[ https://issues.apache.org/jira/browse/SPARK-44524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746296#comment-17746296 ] ASF GitHub Bot commented on SPARK-44524: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/42115 > Add a new test group for pyspark-pandas-slow-connect module > > > Key: SPARK-44524 > URL: https://issues.apache.org/jira/browse/SPARK-44524 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44524) Add a new test group for pyspark-pandas-slow-connect module
[ https://issues.apache.org/jira/browse/SPARK-44524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746294#comment-17746294 ] ASF GitHub Bot commented on SPARK-44524: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/42115 > Add a new test group for pyspark-pandas-slow-connect module > > > Key: SPARK-44524 > URL: https://issues.apache.org/jira/browse/SPARK-44524 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44523) Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral
[ https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44523: Summary: Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral (was: Filter's maxRows should be 0 if condition is FalseLiteral) > Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral > -- > > Key: SPARK-44523 > URL: https://issues.apache.org/jira/browse/SPARK-44523 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44523) Filter's maxRows should be 0 if condition is FalseLiteral
[ https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746290#comment-17746290 ] Yuming Wang commented on SPARK-44523: - https://github.com/apache/spark/pull/42126 > Filter's maxRows should be 0 if condition is FalseLiteral > - > > Key: SPARK-44523 > URL: https://issues.apache.org/jira/browse/SPARK-44523 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44523) Filter's maxRows should be 0 if condition is FalseLiteral
[ https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746288#comment-17746288 ] ASF GitHub Bot commented on SPARK-44523: User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/42126 > Filter's maxRows should be 0 if condition is FalseLiteral > - > > Key: SPARK-44523 > URL: https://issues.apache.org/jira/browse/SPARK-44523 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44524) Add a new test group for pyspark-pandas-slow-connect module
BingKun Pan created SPARK-44524: --- Summary: Add a new test group for pyspark-pandas-slow-connect module Key: SPARK-44524 URL: https://issues.apache.org/jira/browse/SPARK-44524 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44523) Filter's maxRows should be 0 if condition is FalseLiteral
[ https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746286#comment-17746286 ] ASF GitHub Bot commented on SPARK-44523: User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/42126 > Filter's maxRows should be 0 if condition is FalseLiteral > - > > Key: SPARK-44523 > URL: https://issues.apache.org/jira/browse/SPARK-44523 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44523) Filter's maxRows should be 0 if condition is FalseLiteral
Yuming Wang created SPARK-44523: --- Summary: Filter's maxRows should be 0 if condition is FalseLiteral Key: SPARK-44523 URL: https://issues.apache.org/jira/browse/SPARK-44523 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output if AQE is enabled
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Description: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found then when AQE is enabled, the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}} Below is the complete code that reproduces the problem. was: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found then when AQE is enabled, the following code does not produce sorted output {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}} Below is the complete code that reproduces the problem. > dataset.sort.select.write.partitionBy does not return a sorted output if AQE > is enabled > --- > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44509) Fine grained interrupt in Python Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-44509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746270#comment-17746270 ] GridGain Integration commented on SPARK-44509: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/42120 > Fine grained interrupt in Python Spark Connect > -- > > Key: SPARK-44509 > URL: https://issues.apache.org/jira/browse/SPARK-44509 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Priority: Major > > Same as https://issues.apache.org/jira/browse/SPARK-44422 but need it for > Python > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44371) Define the computing logic through PartitionEvaluator API and use it in CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec
[ https://issues.apache.org/jira/browse/SPARK-44371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng resolved SPARK-44371. Resolution: Won't Fix > Define the computing logic through PartitionEvaluator API and use it in > CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec > - > > Key: SPARK-44371 > URL: https://issues.apache.org/jira/browse/SPARK-44371 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-44371) Define the computing logic through PartitionEvaluator API and use it in CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec
[ https://issues.apache.org/jira/browse/SPARK-44371 ] jiaan.geng deleted comment on SPARK-44371: was (Author: beliefer): I'm working on. > Define the computing logic through PartitionEvaluator API and use it in > CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec > - > > Key: SPARK-44371 > URL: https://issues.apache.org/jira/browse/SPARK-44371 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44371) Define the computing logic through PartitionEvaluator API and use it in CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec
[ https://issues.apache.org/jira/browse/SPARK-44371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746248#comment-17746248 ] jiaan.geng commented on SPARK-44371: [~cloud_fan] and I discussed offline, we doesn't need do the change. > Define the computing logic through PartitionEvaluator API and use it in > CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec > - > > Key: SPARK-44371 > URL: https://issues.apache.org/jira/browse/SPARK-44371 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output if AQE is enabled
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Attachment: (was: Test.java) > dataset.sort.select.write.partitionBy does not return a sorted output if AQE > is enabled > --- > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output if AQE is enabled
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746025#comment-17746025 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/24/23 7:12 AM: Here is the [gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that reproduces the issue. To compile: javac Test.java && jar cvf Test.jar Test.class bug reproduce: spark-submit --class Test Test.jar no bug if workaround is enabled: spark-submit --class Test Test.jar workaround no bug too if AQE is disabled: spark-submit --conf spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each partition key) was (Author: JIRAUSER301473): [^Test.java] (Attached the code) To compile: javac Test.java && jar cvf Test.jar Test.class bug reproduce: spark-submit --class Test Test.jar no bug if workaround is enabled: spark-submit --class Test Test.jar workaround no bug too if AQE is disabled: spark-submit --conf spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each partition key) > dataset.sort.select.write.partitionBy does not return a sorted output if AQE > is enabled > --- > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found then when AQE is enabled, the following code does not produce sorted > output > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org