date:20230724

[jira] [Updated] (SPARK-44541) Remove useless function `hasRangeExprAgainstEventTimeCol` from `UnsupportedOperationChecker`

2023-07-24 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-44541:
-
Summary: Remove useless function `hasRangeExprAgainstEventTimeCol` from 
`UnsupportedOperationChecker`  (was: Remove useless funciton 
`hasRangeExprAgainstEventTimeCol` from `UnsupportedOperationChecker`)

> Remove useless function `hasRangeExprAgainstEventTimeCol` from 
> `UnsupportedOperationChecker`
> 
>
> Key: SPARK-44541
> URL: https://issues.apache.org/jira/browse/SPARK-44541
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> funciton `hasRangeExprAgainstEventTimeCol` was introduced by SPARK-40940 and 
> no longer be used after SPARK-42376
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44541) Remove useless funciton `hasRangeExprAgainstEventTimeCol` from `UnsupportedOperationChecker`

2023-07-24 Thread Yang Jie (Jira)

Yang Jie created SPARK-44541:


 Summary: Remove useless funciton `hasRangeExprAgainstEventTimeCol` 
from `UnsupportedOperationChecker`
 Key: SPARK-44541
 URL: https://issues.apache.org/jira/browse/SPARK-44541
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yang Jie


funciton `hasRangeExprAgainstEventTimeCol` was introduced by SPARK-40940 and no 
longer be used after SPARK-42376

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 5:41 AM:


After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) 
seems produces a sorted output as expected, even without the workaround.

(And I found that this option cannot be set in code directly. It must be set in 
spark-submit. This config option is also undocumented.)


was (Author: JIRAUSER301473):
After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) 
seems produces a sorted output.

(And I found that this option cannot be set in code directly. It must be set in 
spark-submit. This config option is also undocumented)

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Description: 
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem)

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 

However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.

{{dataset = dataset.sort("_1")}}
{{.select("_2", "_3");}}
{{dataset.map((MapFunction) row -> row, dataset.encoder())}}
{{.write()}}
{{{}.{}}}{{{}partitionBy("_2"){}}}
{{.text("output")}}

Below is the complete code that reproduces the problem.

  was:
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found then when AQE is enabled, the following code does not produce sorted 
output (.drop() also have the same problem)

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 

However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.

{{dataset = dataset.sort("_1")}}
{{.select("_2", "_3");}}
{{dataset.map((MapFunction) row -> row, dataset.encoder())}}
{{.write()}}
{{{}.{}}}{{{}partitionBy("_2"){}}}
{{.text("output")}}

Below is the complete code that reproduces the problem.


> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44540) Remove unused stylesheet and javascript files of jsonFormatter

2023-07-24 Thread ci-cassandra.apache.org (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746781#comment-17746781
 ] 

ci-cassandra.apache.org commented on SPARK-44540:
-

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/42145

> Remove unused stylesheet and javascript files of jsonFormatter
> --
>
> Key: SPARK-44540
> URL: https://issues.apache.org/jira/browse/SPARK-44540
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Priority: Major
>
> jsonFormatter.min.css and jsonFormatter.min.js is unreached



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44454) HiveShim getTablesByType support fallback

2023-07-24 Thread ci-cassandra.apache.org (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746751#comment-17746751
 ] 

ci-cassandra.apache.org commented on SPARK-44454:
-

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/42033

> HiveShim getTablesByType support fallback
> -
>
> Key: SPARK-44454
> URL: https://issues.apache.org/jira/browse/SPARK-44454
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: dzcxzl
>Priority: Minor
>
> When we use a high version of Hive Client to communicate with a low version 
> of Hive meta store, we may encounter Invalid method name: 
> 'get_tables_by_type'.
>  
> {code:java}
> 23/07/17 12:45:24,391 [main] DEBUG SparkSqlParser: Parsing command: show views
> 23/07/17 12:45:24,489 [main] ERROR log: Got exception: 
> org.apache.thrift.TApplicationException Invalid method name: 
> 'get_tables_by_type'
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_tables_by_type'
>     at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
>     at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_tables_by_type(ThriftHiveMetastore.java:1433)
>     at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_tables_by_type(ThriftHiveMetastore.java:1418)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTables(HiveMetaStoreClient.java:1411)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
>     at com.sun.proxy.$Proxy23.getTables(Unknown Source)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2344)
>     at com.sun.proxy.$Proxy23.getTables(Unknown Source)
>     at org.apache.hadoop.hive.ql.metadata.Hive.getTablesByType(Hive.java:1427)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.spark.sql.hive.client.Shim_v2_3.getTablesByType(HiveShim.scala:1408)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:789)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:785)
>     at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:895)
>     at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:108)
>     at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listViews(HiveExternalCatalog.scala:893)
>     at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listViews(ExternalCatalogWithListener.scala:158)
>     at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listViews(SessionCatalog.scala:1040)
>     at 
> org.apache.spark.sql.execution.command.ShowViewsCommand.$anonfun$run$5(views.scala:407)
>     at scala.Option.getOrElse(Option.scala:189)
>     at 
> org.apache.spark.sql.execution.command.ShowViewsCommand.run(views.scala:407) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44523) Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral

2023-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44523.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42126
[https://github.com/apache/spark/pull/42126]

> Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral
> --
>
> Key: SPARK-44523
> URL: https://issues.apache.org/jira/browse/SPARK-44523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44523) Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral

2023-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44523:


Assignee: Yuming Wang

> Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral
> --
>
> Key: SPARK-44523
> URL: https://issues.apache.org/jira/browse/SPARK-44523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44540) Remove unused stylesheet and javascript files of jsonFormatter

2023-07-24 Thread Kent Yao (Jira)

Kent Yao created SPARK-44540:


 Summary: Remove unused stylesheet and javascript files of 
jsonFormatter
 Key: SPARK-44540
 URL: https://issues.apache.org/jira/browse/SPARK-44540
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.5.0
Reporter: Kent Yao


jsonFormatter.min.css and jsonFormatter.min.js is unreached



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44466) Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX from modifiedConfigs

2023-07-24 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44466.
-
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42049
[https://github.com/apache/spark/pull/42049]

> Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX 
> from modifiedConfigs
> 
>
> Key: SPARK-44466
> URL: https://issues.apache.org/jira/browse/SPARK-44466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
> Attachments: screenshot-1.png
>
>
> Should not include this value: 
> !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44466) Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX from modifiedConfigs

2023-07-24 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-44466:
---

Assignee: Yuming Wang

> Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX 
> from modifiedConfigs
> 
>
> Key: SPARK-44466
> URL: https://issues.apache.org/jira/browse/SPARK-44466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Should not include this value: 
> !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44509) Fine grained interrupt in Python Spark Connect

2023-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44509.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42120
[https://github.com/apache/spark/pull/42120]

> Fine grained interrupt in Python Spark Connect
> --
>
> Key: SPARK-44509
> URL: https://issues.apache.org/jira/browse/SPARK-44509
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Same as https://issues.apache.org/jira/browse/SPARK-44422 but need it for 
> Python
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44509) Fine grained interrupt in Python Spark Connect

2023-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44509:


Assignee: Hyukjin Kwon

> Fine grained interrupt in Python Spark Connect
> --
>
> Key: SPARK-44509
> URL: https://issues.apache.org/jira/browse/SPARK-44509
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Same as https://issues.apache.org/jira/browse/SPARK-44422 but need it for 
> Python
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44450) Make direct Arrow encoding work with SQL/API

2023-07-24 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-44450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-44450:
-

Assignee: Herman van Hövell

> Make direct Arrow encoding work with SQL/API
> 
>
> Key: SPARK-44450
> URL: https://issues.apache.org/jira/browse/SPARK-44450
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Component/s: Optimizer

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 1:33 AM:


After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) 
seems produces a sorted output.

(And I found that this option cannot be set in code directly. It must be set in 
spark-submit. This config option is also undocumented)


was (Author: JIRAUSER301473):
After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) 
seems produces a sorted output.

(And I found that this option cannot be set in code directly. It must be set in 
spark-submit)

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44539) Upgrade RoaringBitmap to 0.9.46

2023-07-24 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-44539:
---

 Summary:  Upgrade RoaringBitmap to 0.9.46
 Key: SPARK-44539
 URL: https://issues.apache.org/jira/browse/SPARK-44539
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44538) Remove ToJsonUtil

2023-07-24 Thread Jira

Herman van Hövell created SPARK-44538:
-

 Summary: Remove ToJsonUtil
 Key: SPARK-44538
 URL: https://issues.apache.org/jira/browse/SPARK-44538
 Project: Spark
  Issue Type: New Feature
  Components: Connect, SQL
Affects Versions: 3.4.1
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 1:06 AM:


After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) 
seems produces a sorted output.

(And I found that this option cannot be set in code directly. It must be set in 
spark-submit)


was (Author: JIRAUSER301473):
After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) 
seems produces a sorted output.

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 1:05 AM:


After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) 
seems produces a sorted output.


was (Author: JIRAUSER301473):
-After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) 
seems produces a sorted output.-

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ https://issues.apache.org/jira/browse/SPARK-44512 ]


Yiu-Chung Lee deleted comment on SPARK-44512:
---

was (Author: JIRAUSER301473):
No. After testing another production data, 
spark.sql.optimizer.plannedWrite.enabled=false does not solve the problem 
either.

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 1:03 AM:


-After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) 
seems produces a sorted output.-


was (Author: JIRAUSER301473):
After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) 
seems produces a sorted output.

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746715#comment-17746715
 ] 

Yiu-Chung Lee commented on SPARK-44512:
---

No. After testing another production data, 
spark.sql.optimizer.plannedWrite.enabled=false does not solve the problem 
either.

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44449) Add upcasting to Arrow deserializers

2023-07-24 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-9.
---
Fix Version/s: 3.5.0
 Assignee: Herman van Hövell
   Resolution: Fixed

> Add upcasting to Arrow deserializers
> 
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44537) Upgrade kubernetes-client to 6.8.0

2023-07-24 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-44537:
---

 Summary:  Upgrade kubernetes-client to 6.8.0
 Key: SPARK-44537
 URL: https://issues.apache.org/jira/browse/SPARK-44537
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44486) Implement PyArrow `self_destruct` feature for `toPandas`

2023-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44486:


Assignee: Xinrong Meng

> Implement PyArrow `self_destruct` feature for `toPandas`
> 
>
> Key: SPARK-44486
> URL: https://issues.apache.org/jira/browse/SPARK-44486
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Implement PyArrow `self_destruct` feature for `toPandas`
> To make the Spark configuration 
> `spark.sql.execution.arrow.pyspark.selfDestruct.enabled` be used to enable 
> PyArrow’s `self_destruct` feature in Spark Connect, which can save memory 
> when creating a Pandas DataFrame via `toPandas` by freeing Arrow-allocated 
> memory while building the Pandas DataFrame. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44486) Implement PyArrow `self_destruct` feature for `toPandas`

2023-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44486.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42079
[https://github.com/apache/spark/pull/42079]

> Implement PyArrow `self_destruct` feature for `toPandas`
> 
>
> Key: SPARK-44486
> URL: https://issues.apache.org/jira/browse/SPARK-44486
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Implement PyArrow `self_destruct` feature for `toPandas`
> To make the Spark configuration 
> `spark.sql.execution.arrow.pyspark.selfDestruct.enabled` be used to enable 
> PyArrow’s `self_destruct` feature in Spark Connect, which can save memory 
> when creating a Pandas DataFrame via `toPandas` by freeing Arrow-allocated 
> memory while building the Pandas DataFrame. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44536) Upgrade sbt to 1.9.3

2023-07-24 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-44536:
---

 Summary: Upgrade sbt to 1.9.3
 Key: SPARK-44536
 URL: https://issues.apache.org/jira/browse/SPARK-44536
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44535) Move Streaming API to sql/api

2023-07-24 Thread Jira

Herman van Hövell created SPARK-44535:
-

 Summary: Move Streaming API to sql/api
 Key: SPARK-44535
 URL: https://issues.apache.org/jira/browse/SPARK-44535
 Project: Spark
  Issue Type: New Feature
  Components: Connect, SQL
Affects Versions: 3.4.1
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44500) parse_url treats key as regular expression

2023-07-24 Thread Pablo Langa Blanco (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746705#comment-17746705
 ] 

Pablo Langa Blanco commented on SPARK-44500:


[~jan.chou...@gmail.com] What do you think?

> parse_url treats key as regular expression
> --
>
> Key: SPARK-44500
> URL: https://issues.apache.org/jira/browse/SPARK-44500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.4.1
>Reporter: Robert Joseph Evans
>Priority: Major
>
> To be clear I am not 100% sure that this is a bug. It might be a feature, but 
> I don't see anywhere that it is used as a feature. If it is a feature it 
> really should be documented, because there are pitfalls. If it is a bug it 
> should be fixed because it is really confusing and it is simple to shoot 
> yourself in the foot.
> ```scala
> > val urls = Seq("http://foo/bar?abc=BAD&a.c=GOOD";, 
> > "http://foo/bar?a.c=GOOD&abc=BAD";).toDF
> > urls.selectExpr("parse_url(value, 'QUERY', 'a.c')").show(false)
> ++
> |parse_url(value, QUERY, a.c)|
> ++
> |BAD |
> |GOOD|
> ++
> > urls.selectExpr("parse_url(value, 'QUERY', 'a[c')").show(false)
> java.util.regex.PatternSyntaxException: Unclosed character class near index 15
> (&|^)a[c=([^&]*)
>^
>   at java.util.regex.Pattern.error(Pattern.java:1969)
>   at java.util.regex.Pattern.clazz(Pattern.java:2562)
>   at java.util.regex.Pattern.sequence(Pattern.java:2077)
>   at java.util.regex.Pattern.expr(Pattern.java:2010)
>   at java.util.regex.Pattern.compile(Pattern.java:1702)
>   at java.util.regex.Pattern.(Pattern.java:1352)
>   at java.util.regex.Pattern.compile(Pattern.java:1028)
> ```
> The simple fix is to quote the key when making the pattern.
> ```scala
>   private def getPattern(key: UTF8String): Pattern = {
> Pattern.compile(REGEXPREFIX + Pattern.quote(key.toString) + REGEXSUBFIX)
>   }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/24/23 11:32 PM:
-

After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false seems produces a sorted output.


was (Author: JIRAUSER301473):
Setting spark.sql.optimizer.plannedWrite.enabled=false seems produces a sorted 
output.

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/24/23 11:32 PM:
-

After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false (while leaving AQE enabled) 
seems produces a sorted output.


was (Author: JIRAUSER301473):
After reading SPARK-41914, I found that setting 
spark.sql.optimizer.plannedWrite.enabled=false seems produces a sorted output.

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/24/23 11:29 PM:
-

Setting spark.sql.optimizer.plannedWrite.enabled=false seems produces a sorted 
output.


was (Author: JIRAUSER301473):
Setting {{spark.sql.optimizer.plannedWrite.enabled=false seems produces a 
sorted output.}}

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746697#comment-17746697
 ] 

Yiu-Chung Lee commented on SPARK-44512:
---

Setting {{spark.sql.optimizer.plannedWrite.enabled=false seems produces a 
sorted output.}}

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44534) Handle only shuffle files in KubernetesLocalDiskShuffleExecutorComponents

2023-07-24 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-44534:
-

 Summary: Handle only shuffle files in 
KubernetesLocalDiskShuffleExecutorComponents
 Key: SPARK-44534
 URL: https://issues.apache.org/jira/browse/SPARK-44534
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.5.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44503) Support PARTITION BY and ORDER BY clause for table arguments

2023-07-24 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-44503.
---
Fix Version/s: 4.0.0
 Assignee: Daniel
   Resolution: Fixed

Issue resolved by pull request 42100
https://github.com/apache/spark/pull/42100

> Support PARTITION BY and ORDER BY clause for table arguments
> 
>
> Key: SPARK-44503
> URL: https://issues.apache.org/jira/browse/SPARK-44503
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44455) SHOW CREATE TABLE does not quote identifiers with special characters

2023-07-24 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-44455:
--

Assignee: Runyao.Chen

> SHOW CREATE TABLE does not quote identifiers with special characters
> 
>
> Key: SPARK-44455
> URL: https://issues.apache.org/jira/browse/SPARK-44455
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Runyao.Chen
>Assignee: Runyao.Chen
>Priority: Major
>
> Create a table with special characters:
> ```
> CREATE CATALOG `a_catalog-with+special^chars`; CREATE SCHEMA 
> `a_catalog-with+special^chars`.`a_schema-with+special^chars`; CREATE TABLE 
> `a_catalog-with+special^chars`.`a_schema-with+special^chars`.`table1` ( id 
> int, feat1 varchar(255), CONSTRAINT pk PRIMARY KEY (id,feat1) );
> ```
> Then run SHOW CREATE TABLE:
> ```
> SHOW CREATE TABLE 
> `a_catalog-with+special^chars`.`a_schema-with+special^chars`.`table1`;
> ```
> The response is:
> ```
> createtab_stmt "CREATE TABLE 
> a_catalog-with+special^chars.a_schema-with+special^chars.table1 ( id INT NOT 
> NULL, feat1 VARCHAR(255) NOT NULL, CONSTRAINT pk PRIMARY KEY (id, feat1)) 
> USING delta TBLPROPERTIES ( 'delta.minReaderVersion' = '1', 
> 'delta.minWriterVersion' = '2') "
> ```
> As you can see, the table name in the response is not properly escaped with 
> backticks. As a result, if a user copies and pastes this create table command 
> to recreate the table, it will fail:
> {{[INVALID_IDENTIFIER] The identifier a_catalog-with is invalid. Please, 
> consider quoting it with back-quotes as `a_catalog-with`.(line 1, pos 22)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44516) Spark Connect Python StreamingQueryListener removeListener method actually shut down the listener process

2023-07-24 Thread Wei Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu updated SPARK-44516:

Summary: Spark Connect Python StreamingQueryListener removeListener method 
actually shut down the listener process  (was: Spark Connect Python 
StreamingQueryListener removeListener method)

> Spark Connect Python StreamingQueryListener removeListener method actually 
> shut down the listener process
> -
>
> Key: SPARK-44516
> URL: https://issues.apache.org/jira/browse/SPARK-44516
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Wei Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44455) SHOW CREATE TABLE does not quote identifiers with special characters

2023-07-24 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-44455.

Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 42034
[https://github.com/apache/spark/pull/42034]

> SHOW CREATE TABLE does not quote identifiers with special characters
> 
>
> Key: SPARK-44455
> URL: https://issues.apache.org/jira/browse/SPARK-44455
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Runyao.Chen
>Assignee: Runyao.Chen
>Priority: Major
> Fix For: 3.5.0
>
>
> Create a table with special characters:
> ```
> CREATE CATALOG `a_catalog-with+special^chars`; CREATE SCHEMA 
> `a_catalog-with+special^chars`.`a_schema-with+special^chars`; CREATE TABLE 
> `a_catalog-with+special^chars`.`a_schema-with+special^chars`.`table1` ( id 
> int, feat1 varchar(255), CONSTRAINT pk PRIMARY KEY (id,feat1) );
> ```
> Then run SHOW CREATE TABLE:
> ```
> SHOW CREATE TABLE 
> `a_catalog-with+special^chars`.`a_schema-with+special^chars`.`table1`;
> ```
> The response is:
> ```
> createtab_stmt "CREATE TABLE 
> a_catalog-with+special^chars.a_schema-with+special^chars.table1 ( id INT NOT 
> NULL, feat1 VARCHAR(255) NOT NULL, CONSTRAINT pk PRIMARY KEY (id, feat1)) 
> USING delta TBLPROPERTIES ( 'delta.minReaderVersion' = '1', 
> 'delta.minWriterVersion' = '2') "
> ```
> As you can see, the table name in the response is not properly escaped with 
> backticks. As a result, if a user copies and pastes this create table command 
> to recreate the table, it will fail:
> {{[INVALID_IDENTIFIER] The identifier a_catalog-with is invalid. Please, 
> consider quoting it with back-quotes as `a_catalog-with`.(line 1, pos 22)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44533) Add support for accumulator, broadcast, and Spark files in Python UDTF's analyze.

2023-07-24 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-44533:
-

 Summary: Add support for accumulator, broadcast, and Spark files 
in Python UDTF's analyze.
 Key: SPARK-44533
 URL: https://issues.apache.org/jira/browse/SPARK-44533
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44532) Move ArrowUtil to sql/api

2023-07-24 Thread Jira

Herman van Hövell created SPARK-44532:
-

 Summary: Move ArrowUtil to sql/api
 Key: SPARK-44532
 URL: https://issues.apache.org/jira/browse/SPARK-44532
 Project: Spark
  Issue Type: New Feature
  Components: Connect, SQL
Affects Versions: 3.4.1
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44531) Move encoder inference to sql/api

2023-07-24 Thread Jira

Herman van Hövell created SPARK-44531:
-

 Summary: Move encoder inference to sql/api
 Key: SPARK-44531
 URL: https://issues.apache.org/jira/browse/SPARK-44531
 Project: Spark
  Issue Type: New Feature
  Components: Connect, SQL
Affects Versions: 3.4.1
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44530) Move SparkBuildInfo to common/util

2023-07-24 Thread Jira

Herman van Hövell created SPARK-44530:
-

 Summary: Move SparkBuildInfo to common/util
 Key: SPARK-44530
 URL: https://issues.apache.org/jira/browse/SPARK-44530
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44529) Add a flag to resolve docker tags to hashes at launch time

2023-07-24 Thread Holden Karau (Jira)

Holden Karau created SPARK-44529:


 Summary: Add a flag to resolve docker tags to hashes at launch time
 Key: SPARK-44529
 URL: https://issues.apache.org/jira/browse/SPARK-44529
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.5.0, 4.0.0
Reporter: Holden Karau


If you have an Spark docker tag (like say 3.3) you might want to update the 
container but only for newly launched jobs not existing jobs. To allow this we 
can resolve the tag to hash at launch time.

 

In some environments this may also offer a small performance improvement as it 
saves K8s from having to re-resolve the tag with additional executor launches.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44528) Spark Connect DataFrame does not allow to add custom instance attributes and check for it

2023-07-24 Thread Martin Grund (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Grund updated SPARK-44528:
-
Summary: Spark Connect DataFrame does not allow to add custom instance 
attributes and check for it  (was: Spark Connect DataFrame does not allow to 
add custom instance attributes)

> Spark Connect DataFrame does not allow to add custom instance attributes and 
> check for it
> -
>
> Key: SPARK-44528
> URL: https://issues.apache.org/jira/browse/SPARK-44528
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Martin Grund
>Priority: Major
>
> ```
> df = spark.range(10)
> df._test = 10
> ```
> Treats `df._test` like a column



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44528) Spark Connect DataFrame does not allow to add custom instance attributes and check for it

2023-07-24 Thread Martin Grund (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Grund updated SPARK-44528:
-
Description: 
```
df = spark.range(10)
df._test = 10

assert(hasattr(df, "_test"))
assert(!hasattr(df, "_test_no"))
```

Treats `df._test` like a column

  was:
```
df = spark.range(10)
df._test = 10
```

Treats `df._test` like a column


> Spark Connect DataFrame does not allow to add custom instance attributes and 
> check for it
> -
>
> Key: SPARK-44528
> URL: https://issues.apache.org/jira/browse/SPARK-44528
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Martin Grund
>Priority: Major
>
> ```
> df = spark.range(10)
> df._test = 10
> assert(hasattr(df, "_test"))
> assert(!hasattr(df, "_test_no"))
> ```
> Treats `df._test` like a column



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44528) Spark Connect DataFrame does not allow to add custom instance attributes

2023-07-24 Thread Martin Grund (Jira)

Martin Grund created SPARK-44528:


 Summary: Spark Connect DataFrame does not allow to add custom 
instance attributes
 Key: SPARK-44528
 URL: https://issues.apache.org/jira/browse/SPARK-44528
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.4.1
Reporter: Martin Grund


```
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44528) Spark Connect DataFrame does not allow to add custom instance attributes

2023-07-24 Thread Martin Grund (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Grund updated SPARK-44528:
-
Description: 
```
df = spark.range(10)
df._test = 10
```

Treats `df._test` like a column

  was:
```
```


> Spark Connect DataFrame does not allow to add custom instance attributes
> 
>
> Key: SPARK-44528
> URL: https://issues.apache.org/jira/browse/SPARK-44528
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Martin Grund
>Priority: Major
>
> ```
> df = spark.range(10)
> df._test = 10
> ```
> Treats `df._test` like a column



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-07-24 Thread Ramakrishna (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746516#comment-17746516
 ] 

Ramakrishna edited comment on SPARK-44152 at 7/24/23 4:06 PM:
--

Hello [~sdehaes] 

It should work if you copy jar to 

 

/usr/local/bin folder of your docker container

 

. It worked for us


was (Author: hande):
Hello [~sdehaes] 

It should work if you copy jar to 

 

/usr/local/bin folder

 

. It worked for us

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44152
> URL: https://issues.apache.org/jira/browse/SPARK-44152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-07-24 Thread Ramakrishna (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746516#comment-17746516
 ] 

Ramakrishna commented on SPARK-44152:
-

Hello [~sdehaes] 

It should work if you copy jar to 

 

/usr/local/bin folder

 

. It worked for us

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44152
> URL: https://issues.apache.org/jira/browse/SPARK-44152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44527) Simplify BinaryComparison if its children contain ScalarSubquery with empty output

2023-07-24 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746488#comment-17746488
 ] 

Yuming Wang commented on SPARK-44527:
-

https://github.com/apache/spark/pull/42129

> Simplify BinaryComparison if its children contain ScalarSubquery with empty 
> output
> --
>
> Key: SPARK-44527
> URL: https://issues.apache.org/jira/browse/SPARK-44527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746025#comment-17746025
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/24/23 2:38 PM:


Here is the 
[gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that 
reproduces the issue.

To compile: javac Test.java && jar cvf Test.jar Test.class
bug reproduce: spark-submit --class Test Test.jar
no bug if workaround is enabled: spark-submit --class Test Test.jar workaround
-no bug too if AQE is disabled: spark-submit --conf 
spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each 
partition key)-

 


was (Author: JIRAUSER301473):
Here is the 
[gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that 
reproduces the issue.


To compile: javac Test.java && jar cvf Test.jar Test.class
bug reproduce: spark-submit --class Test Test.jar
no bug if workaround is enabled: spark-submit --class Test Test.jar workaround
no bug too if AQE is disabled: spark-submit --conf 
spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each 
partition key)

 

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746483#comment-17746483
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/24/23 2:38 PM:


After further test by running on my production data, I found that disabling AQE 
actually still does not produce sorted result.


was (Author: JIRAUSER301473):
After further test, disabling AQE actually still does not produce a sorted 
result.

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-24 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Summary: dataset.sort.select.write.partitionBy does not return a sorted 
output  (was: dataset.sort.select.write.partitionBy does not return a sorted 
output if AQE is enabled)

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output if AQE is enabled

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746483#comment-17746483
 ] 

Yiu-Chung Lee commented on SPARK-44512:
---

After further test, disabling AQE actually still does not produce a sorted 
result.

> dataset.sort.select.write.partitionBy does not return a sorted output if AQE 
> is enabled
> ---
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44527) Simplify BinaryComparison if its children contain ScalarSubquery with empty output

2023-07-24 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44527:
---

 Summary: Simplify BinaryComparison if its children contain 
ScalarSubquery with empty output
 Key: SPARK-44527
 URL: https://issues.apache.org/jira/browse/SPARK-44527
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44513) Upgrade snappy-java to 1.1.10.3

2023-07-24 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-44513:
-
Affects Version/s: 3.4.1
   (was: 4.0.0)

> Upgrade snappy-java to 1.1.10.3
> ---
>
> Key: SPARK-44513
> URL: https://issues.apache.org/jira/browse/SPARK-44513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44513) Upgrade snappy-java to 1.1.10.3

2023-07-24 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-44513:
-
Fix Version/s: (was: 4.0.0)

> Upgrade snappy-java to 1.1.10.3
> ---
>
> Key: SPARK-44513
> URL: https://issues.apache.org/jira/browse/SPARK-44513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone

2023-07-24 Thread Faiz Halde (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-44526:
---
Description: 
Hi,

This ticket is meant to understand the work that would be involved in porting 
the k8s PVC reuse feature onto the spark standalone cluster manager which 
reuses the shuffle files present locally in the disk

We are a heavy user of spot instances and we suffer from spot terminations 
impacting our long running jobs

The logic in `KubernetesLocalDiskShuffleDataIO`
itself is not that much. However when I tried this on the 
`LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
suggests there is more to it

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute

  was:
Hi,

This ticket is meant to understand the work that would be involved in porting 
the k8s PVC reuse feature onto the spark standalone cluster manager which 
reuses the shuffle files present locally in the disk

We are a heavy user of spot instances and we suffer from spot terminations 
impacting our long running jobs

The logic in
KubernetesLocalDiskShuffleDataIO
itself is not that much. However when I tried this on the 
`LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
suggests there is more to it

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute


> Porting k8s PVC reuse logic to spark standalone
> ---
>
> Key: SPARK-44526
> URL: https://issues.apache.org/jira/browse/SPARK-44526
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.4.1
>Reporter: Faiz Halde
>Priority: Major
>
> Hi,
> This ticket is meant to understand the work that would be involved in porting 
> the k8s PVC reuse feature onto the spark standalone cluster manager which 
> reuses the shuffle files present locally in the disk
> We are a heavy user of spot instances and we suffer from spot terminations 
> impacting our long running jobs
> The logic in `KubernetesLocalDiskShuffleDataIO`
> itself is not that much. However when I tried this on the 
> `LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
> suggests there is more to it
> I'd like to understand what will be the work involved for this. We'll be more 
> than happy to contribute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone

2023-07-24 Thread Faiz Halde (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-44526:
---
Description: 
Hi,

This ticket is meant to understand the work that would be involved in porting 
the k8s PVC reuse feature onto the spark standalone cluster manager which 
reuses the shuffle files present locally in the disk

We are a heavy user of spot instances and we suffer from spot terminations 
impacting our long running jobs

The logic in `KubernetesLocalDiskShuffleExecutorComponents` itself is not that 
much. However when I tried this on the `LocalDiskShuffleExecutorComponents` it 
was not a successful experiment which suggests there is more to recovering 
shuffle files

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute

  was:
Hi,

This ticket is meant to understand the work that would be involved in porting 
the k8s PVC reuse feature onto the spark standalone cluster manager which 
reuses the shuffle files present locally in the disk

We are a heavy user of spot instances and we suffer from spot terminations 
impacting our long running jobs

The logic in `KubernetesLocalDiskShuffleDataIO`
itself is not that much. However when I tried this on the 
`LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
suggests there is more to it

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute


> Porting k8s PVC reuse logic to spark standalone
> ---
>
> Key: SPARK-44526
> URL: https://issues.apache.org/jira/browse/SPARK-44526
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.4.1
>Reporter: Faiz Halde
>Priority: Major
>
> Hi,
> This ticket is meant to understand the work that would be involved in porting 
> the k8s PVC reuse feature onto the spark standalone cluster manager which 
> reuses the shuffle files present locally in the disk
> We are a heavy user of spot instances and we suffer from spot terminations 
> impacting our long running jobs
> The logic in `KubernetesLocalDiskShuffleExecutorComponents` itself is not 
> that much. However when I tried this on the 
> `LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
> suggests there is more to recovering shuffle files
> I'd like to understand what will be the work involved for this. We'll be more 
> than happy to contribute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone

2023-07-24 Thread Faiz Halde (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-44526:
---
Description: 
Hi,

This ticket is meant to understand the work that would be involved in porting 
the k8s PVC reuse feature onto the spark standalone cluster manager which 
reuses the shuffle files present locally in the disk

We are a heavy user of spot instances and we suffer from spot terminations 
impacting our long running jobs

The logic in
KubernetesLocalDiskShuffleDataIO
itself is not that much. However when I tried this on the 
`LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
suggests there is more to it

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute

  was:
Hi,

This ticket is meant to understand the work that would be involved in porting 
the PVC reuse feature onto the spark standalone cluster manager

 

The logic in
KubernetesLocalDiskShuffleDataIO
itself is not that much. However when I tried this on the 
`LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
suggests there is more to it

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute


> Porting k8s PVC reuse logic to spark standalone
> ---
>
> Key: SPARK-44526
> URL: https://issues.apache.org/jira/browse/SPARK-44526
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.4.1
>Reporter: Faiz Halde
>Priority: Major
>
> Hi,
> This ticket is meant to understand the work that would be involved in porting 
> the k8s PVC reuse feature onto the spark standalone cluster manager which 
> reuses the shuffle files present locally in the disk
> We are a heavy user of spot instances and we suffer from spot terminations 
> impacting our long running jobs
> The logic in
> KubernetesLocalDiskShuffleDataIO
> itself is not that much. However when I tried this on the 
> `LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
> suggests there is more to it
> I'd like to understand what will be the work involved for this. We'll be more 
> than happy to contribute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone

2023-07-24 Thread Faiz Halde (Jira)

Faiz Halde created SPARK-44526:
--

 Summary: Porting k8s PVC reuse logic to spark standalone
 Key: SPARK-44526
 URL: https://issues.apache.org/jira/browse/SPARK-44526
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle, Spark Core
Affects Versions: 3.4.1
Reporter: Faiz Halde


Hi,

This ticket is meant to understand the work that would be involved in porting 
the PVC reuse feature onto the spark standalone cluster manager

 

The logic in
KubernetesLocalDiskShuffleDataIO
itself is not that much. However when I tried this on the 
`LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
suggests there is more to it

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44525) Improve error message when Invoke method is not found

2023-07-24 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-44525:
-

 Summary: Improve error message when Invoke method is not found
 Key: SPARK-44525
 URL: https://issues.apache.org/jira/browse/SPARK-44525
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.2
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-07-24 Thread Stijn De Haes (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746436#comment-17746436
 ] 

Stijn De Haes commented on SPARK-44152:
---

We are seeing this issue too, the problem seems to be this PR: 
[https://github.com/apache/spark/pull/37417]

When building an image we copy the jar's into the workdir location, however now 
when the job is running spark will remove everything in that workdir location.
Resulting in this error. I am not sure on how to continue, what would be the 
best location to copy the assembly jar?

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44152
> URL: https://issues.apache.org/jira/browse/SPARK-44152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44519) SparkConnectServerUtils generated incorrect parameters for jars

2023-07-24 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44519.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42121
[https://github.com/apache/spark/pull/42121]

> SparkConnectServerUtils generated incorrect parameters for jars
> ---
>
> Key: SPARK-44519
> URL: https://issues.apache.org/jira/browse/SPARK-44519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> SparkConnectServerUtils generate multiple --jars parameters. It will cause 
> the bug that doesn't find out the class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44519) SparkConnectServerUtils generated incorrect parameters for jars

2023-07-24 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-44519:


Assignee: jiaan.geng

> SparkConnectServerUtils generated incorrect parameters for jars
> ---
>
> Key: SPARK-44519
> URL: https://issues.apache.org/jira/browse/SPARK-44519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> SparkConnectServerUtils generate multiple --jars parameters. It will cause 
> the bug that doesn't find out the class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44521) `SparkConnectServiceSuite` has directory residue after testing

2023-07-24 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-44521:


Assignee: Yang Jie

> `SparkConnectServiceSuite` has directory residue after testing
> --
>
> Key: SPARK-44521
> URL: https://issues.apache.org/jira/browse/SPARK-44521
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> run 
>  
>  
> {code:java}
> build/sbt "connect/testOnly 
> org.apache.spark.sql.connect.planner.SparkConnectServiceSuite"
> git status {code}
>  
> There are residual directories as follows
>  
> {code:java}
>   connector/connect/server/282ce745-440f-44ac-9f43-4fad70d89a44/
>   connector/connect/server/my/ {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44521) `SparkConnectServiceSuite` has directory residue after testing

2023-07-24 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44521.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42122
[https://github.com/apache/spark/pull/42122]

> `SparkConnectServiceSuite` has directory residue after testing
> --
>
> Key: SPARK-44521
> URL: https://issues.apache.org/jira/browse/SPARK-44521
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0, 4.0.0
>
>
> run 
>  
>  
> {code:java}
> build/sbt "connect/testOnly 
> org.apache.spark.sql.connect.planner.SparkConnectServiceSuite"
> git status {code}
>  
> There are residual directories as follows
>  
> {code:java}
>   connector/connect/server/282ce745-440f-44ac-9f43-4fad70d89a44/
>   connector/connect/server/my/ {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43831) Build and Run Spark on Java 21

2023-07-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746304#comment-17746304
 ] 

Dongjoon Hyun commented on SPARK-43831:
---

According to the assessment result (up to now), I switched the Target Version 
from 3.5.0 to 4.0.0 because we need the next version of Apache Arrow dependency.

> Build and Run Spark on Java 21
> --
>
> Key: SPARK-43831
> URL: https://issues.apache.org/jira/browse/SPARK-43831
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - [https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html]
> ||JDK version||Minimum Scala versions||
> |21 (ea)|3.3.1 (soon), 2.13.11, 2.12.18|
> |20|3.3.0, 2.13.11, 2.12.18|
> |19|3.2.0, 2.13.9, 2.12.16|
> |18|3.1.3, 2.13.7, 2.12.15|
> |17 (LTS)|3.0.0, 2.13.6, 2.12.15|
> |11 (LTS)|3.0.0, 2.13.0, 2.12.4, 2.11.12|
> |8 (LTS)|3.0.0, 2.13.0, 2.12.0, 2.11.0|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43831) Build and Run Spark on Java 21

2023-07-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43831:
--
Target Version/s: 4.0.0  (was: 3.5.0)

> Build and Run Spark on Java 21
> --
>
> Key: SPARK-43831
> URL: https://issues.apache.org/jira/browse/SPARK-43831
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - [https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html]
> ||JDK version||Minimum Scala versions||
> |21 (ea)|3.3.1 (soon), 2.13.11, 2.12.18|
> |20|3.3.0, 2.13.11, 2.12.18|
> |19|3.2.0, 2.13.9, 2.12.16|
> |18|3.1.3, 2.13.7, 2.12.15|
> |17 (LTS)|3.0.0, 2.13.6, 2.12.15|
> |11 (LTS)|3.0.0, 2.13.0, 2.12.4, 2.11.12|
> |8 (LTS)|3.0.0, 2.13.0, 2.12.0, 2.11.0|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44524) Add a new test group for pyspark-pandas-slow-connect module

2023-07-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746296#comment-17746296
 ] 

ASF GitHub Bot commented on SPARK-44524:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42115

>  Add a new test group for pyspark-pandas-slow-connect module
> 
>
> Key: SPARK-44524
> URL: https://issues.apache.org/jira/browse/SPARK-44524
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44524) Add a new test group for pyspark-pandas-slow-connect module

2023-07-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746294#comment-17746294
 ] 

ASF GitHub Bot commented on SPARK-44524:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42115

>  Add a new test group for pyspark-pandas-slow-connect module
> 
>
> Key: SPARK-44524
> URL: https://issues.apache.org/jira/browse/SPARK-44524
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44523) Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral

2023-07-24 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44523:

Summary: Filter's maxRows/maxRowsPerPartition is 0 if condition is 
FalseLiteral  (was: Filter's maxRows should be 0 if condition is FalseLiteral)

> Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral
> --
>
> Key: SPARK-44523
> URL: https://issues.apache.org/jira/browse/SPARK-44523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44523) Filter's maxRows should be 0 if condition is FalseLiteral

2023-07-24 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746290#comment-17746290
 ] 

Yuming Wang commented on SPARK-44523:
-

https://github.com/apache/spark/pull/42126

> Filter's maxRows should be 0 if condition is FalseLiteral
> -
>
> Key: SPARK-44523
> URL: https://issues.apache.org/jira/browse/SPARK-44523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44523) Filter's maxRows should be 0 if condition is FalseLiteral

2023-07-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746288#comment-17746288
 ] 

ASF GitHub Bot commented on SPARK-44523:


User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/42126

> Filter's maxRows should be 0 if condition is FalseLiteral
> -
>
> Key: SPARK-44523
> URL: https://issues.apache.org/jira/browse/SPARK-44523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44524) Add a new test group for pyspark-pandas-slow-connect module

2023-07-24 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-44524:
---

 Summary:  Add a new test group for pyspark-pandas-slow-connect 
module
 Key: SPARK-44524
 URL: https://issues.apache.org/jira/browse/SPARK-44524
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44523) Filter's maxRows should be 0 if condition is FalseLiteral

2023-07-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746286#comment-17746286
 ] 

ASF GitHub Bot commented on SPARK-44523:


User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/42126

> Filter's maxRows should be 0 if condition is FalseLiteral
> -
>
> Key: SPARK-44523
> URL: https://issues.apache.org/jira/browse/SPARK-44523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44523) Filter's maxRows should be 0 if condition is FalseLiteral

2023-07-24 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44523:
---

 Summary: Filter's maxRows should be 0 if condition is FalseLiteral
 Key: SPARK-44523
 URL: https://issues.apache.org/jira/browse/SPARK-44523
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output if AQE is enabled

2023-07-24 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Description: 
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found then when AQE is enabled, the following code does not produce sorted 
output (.drop() also have the same problem)

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 

However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.

{{dataset = dataset.sort("_1")}}
{{.select("_2", "_3");}}
{{dataset.map((MapFunction) row -> row, dataset.encoder())}}
{{.write()}}
{{{}.{}}}{{{}partitionBy("_2"){}}}
{{.text("output")}}

Below is the complete code that reproduces the problem.

  was:
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found then when AQE is enabled, the following code does not produce sorted 
output

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 

However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.

{{dataset = dataset.sort("_1")}}
{{.select("_2", "_3");}}
{{dataset.map((MapFunction) row -> row, dataset.encoder())}}
{{.write()}}
{{{}.{}}}{{{}partitionBy("_2"){}}}
{{.text("output")}}

Below is the complete code that reproduces the problem.


> dataset.sort.select.write.partitionBy does not return a sorted output if AQE 
> is enabled
> ---
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44509) Fine grained interrupt in Python Spark Connect

2023-07-24 Thread GridGain Integration (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746270#comment-17746270
 ] 

GridGain Integration commented on SPARK-44509:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/42120

> Fine grained interrupt in Python Spark Connect
> --
>
> Key: SPARK-44509
> URL: https://issues.apache.org/jira/browse/SPARK-44509
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Same as https://issues.apache.org/jira/browse/SPARK-44422 but need it for 
> Python
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44371) Define the computing logic through PartitionEvaluator API and use it in CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec

2023-07-24 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng resolved SPARK-44371.

Resolution: Won't Fix

> Define the computing logic through PartitionEvaluator API and use it in 
> CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec
> -
>
> Key: SPARK-44371
> URL: https://issues.apache.org/jira/browse/SPARK-44371
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-44371) Define the computing logic through PartitionEvaluator API and use it in CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec

2023-07-24 Thread jiaan.geng (Jira)



[ https://issues.apache.org/jira/browse/SPARK-44371 ]


jiaan.geng deleted comment on SPARK-44371:


was (Author: beliefer):
I'm working on.

> Define the computing logic through PartitionEvaluator API and use it in 
> CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec
> -
>
> Key: SPARK-44371
> URL: https://issues.apache.org/jira/browse/SPARK-44371
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44371) Define the computing logic through PartitionEvaluator API and use it in CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec

2023-07-24 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746248#comment-17746248
 ] 

jiaan.geng commented on SPARK-44371:


[~cloud_fan] and I discussed offline, we doesn't need do the change.

> Define the computing logic through PartitionEvaluator API and use it in 
> CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec
> -
>
> Key: SPARK-44371
> URL: https://issues.apache.org/jira/browse/SPARK-44371
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output if AQE is enabled

2023-07-24 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Attachment: (was: Test.java)

> dataset.sort.select.write.partitionBy does not return a sorted output if AQE 
> is enabled
> ---
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output if AQE is enabled

2023-07-24 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746025#comment-17746025
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/24/23 7:12 AM:


Here is the 
[gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that 
reproduces the issue.


To compile: javac Test.java && jar cvf Test.jar Test.class
bug reproduce: spark-submit --class Test Test.jar
no bug if workaround is enabled: spark-submit --class Test Test.jar workaround
no bug too if AQE is disabled: spark-submit --conf 
spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each 
partition key)

 


was (Author: JIRAUSER301473):
[^Test.java]
(Attached the code)

To compile: javac Test.java && jar cvf Test.jar Test.class
bug reproduce: spark-submit --class Test Test.jar
no bug if workaround is enabled: spark-submit --class Test Test.jar workaround
no bug too if AQE is disabled: spark-submit --conf 
spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each 
partition key)

 

> dataset.sort.select.write.partitionBy does not return a sorted output if AQE 
> is enabled
> ---
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found then when AQE is enabled, the following code does not produce sorted 
> output
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

85 matches

Mail list logo