date:20230106

[jira] [Resolved] (SPARK-41935) Skip snapshot check and transfer progress log during publishing snapshots

2023-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41935.
---
Resolution: Invalid

> Skip snapshot check and transfer progress log during publishing snapshots
> -
>
> Key: SPARK-41935
> URL: https://issues.apache.org/jira/browse/SPARK-41935
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41935) Skip snapshot check and transfer progress log during publishing snapshots

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41935:


Assignee: (was: Apache Spark)

> Skip snapshot check and transfer progress log during publishing snapshots
> -
>
> Key: SPARK-41935
> URL: https://issues.apache.org/jira/browse/SPARK-41935
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41935) Skip snapshot check and transfer progress log during publishing snapshots

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41935:


Assignee: Apache Spark

> Skip snapshot check and transfer progress log during publishing snapshots
> -
>
> Key: SPARK-41935
> URL: https://issues.apache.org/jira/browse/SPARK-41935
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41935) Skip snapshot check and transfer progress log during publishing snapshots

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655659#comment-17655659
 ] 

Apache Spark commented on SPARK-41935:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39443

> Skip snapshot check and transfer progress log during publishing snapshots
> -
>
> Key: SPARK-41935
> URL: https://issues.apache.org/jira/browse/SPARK-41935
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41935) Skip snapshot check and transfer progress log in release-build.sh

2023-01-06 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-41935:
-

 Summary: Skip snapshot check and transfer progress log in 
release-build.sh
 Key: SPARK-41935
 URL: https://issues.apache.org/jira/browse/SPARK-41935
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41935) Skip snapshot check and transfer progress log during publishing snapshots

2023-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41935:
--
Summary: Skip snapshot check and transfer progress log during publishing 
snapshots  (was: Skip snapshot check and transfer progress log in 
release-build.sh)

> Skip snapshot check and transfer progress log during publishing snapshots
> -
>
> Key: SPARK-41935
> URL: https://issues.apache.org/jira/browse/SPARK-41935
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41934) Add the unsupported function list for `session`

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41934:


Assignee: (was: Apache Spark)

> Add the unsupported function list for `session`
> ---
>
> Key: SPARK-41934
> URL: https://issues.apache.org/jira/browse/SPARK-41934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41934) Add the unsupported function list for `session`

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41934:


Assignee: Apache Spark

> Add the unsupported function list for `session`
> ---
>
> Key: SPARK-41934
> URL: https://issues.apache.org/jira/browse/SPARK-41934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41934) Add the unsupported function list for `session`

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41934:


Assignee: (was: Apache Spark)

> Add the unsupported function list for `session`
> ---
>
> Key: SPARK-41934
> URL: https://issues.apache.org/jira/browse/SPARK-41934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41934) Add the unsupported function list for `session`

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655658#comment-17655658
 ] 

Apache Spark commented on SPARK-41934:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39442

> Add the unsupported function list for `session`
> ---
>
> Key: SPARK-41934
> URL: https://issues.apache.org/jira/browse/SPARK-41934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41933) Provide local mode that automatically starts the server

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41933:


Assignee: (was: Apache Spark)

> Provide local mode that automatically starts the server
> ---
>
> Key: SPARK-41933
> URL: https://issues.apache.org/jira/browse/SPARK-41933
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently the Spark Connect server has to be started manually which is 
> troublesome for end users and developers to try Spark Connect out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41933) Provide local mode that automatically starts the server

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41933:


Assignee: Apache Spark

> Provide local mode that automatically starts the server
> ---
>
> Key: SPARK-41933
> URL: https://issues.apache.org/jira/browse/SPARK-41933
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently the Spark Connect server has to be started manually which is 
> troublesome for end users and developers to try Spark Connect out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41933) Provide local mode that automatically starts the server

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655657#comment-17655657
 ] 

Apache Spark commented on SPARK-41933:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39441

> Provide local mode that automatically starts the server
> ---
>
> Key: SPARK-41933
> URL: https://issues.apache.org/jira/browse/SPARK-41933
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently the Spark Connect server has to be started manually which is 
> troublesome for end users and developers to try Spark Connect out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41934) Add the unsupported function list for `session`

2023-01-06 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-41934:
-

 Summary: Add the unsupported function list for `session`
 Key: SPARK-41934
 URL: https://issues.apache.org/jira/browse/SPARK-41934
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41933) Provide local mode that automatically starts the server

2023-01-06 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-41933:


 Summary: Provide local mode that automatically starts the server
 Key: SPARK-41933
 URL: https://issues.apache.org/jira/browse/SPARK-41933
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


Currently the Spark Connect server has to be started manually which is 
troublesome for end users and developers to try Spark Connect out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41932) Bootstrapping Spark Connect

2023-01-06 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-41932:


 Summary: Bootstrapping Spark Connect
 Key: SPARK-41932
 URL: https://issues.apache.org/jira/browse/SPARK-41932
 Project: Spark
  Issue Type: Test
  Components: Connect
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


We should:

1. Have an easy way to start the server. Like sbin/start-thriftserver
2. Provide an easy way to run the PySpark shell without manual server start. 
Like spark-sql script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40451) Type annotations for Spark Connect Python client

2023-01-06 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655647#comment-17655647
 ] 

Hyukjin Kwon commented on SPARK-40451:
--

I believe this is done.

> Type annotations for Spark Connect Python client
> 
>
> Key: SPARK-40451
> URL: https://issues.apache.org/jira/browse/SPARK-40451
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>
> The mypy checks for the Spark Connect client have been disabled to make 
> quicker progress with the merge of the code. The goal for this task is to 
> address the failing checks and re-enable mypy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40451) Type annotations for Spark Connect Python client

2023-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40451.
--
  Assignee: Hyukjin Kwon
Resolution: Done

> Type annotations for Spark Connect Python client
> 
>
> Key: SPARK-40451
> URL: https://issues.apache.org/jira/browse/SPARK-40451
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Hyukjin Kwon
>Priority: Major
>
> The mypy checks for the Spark Connect client have been disabled to make 
> quicker progress with the merge of the code. The goal for this task is to 
> address the failing checks and re-enable mypy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41927) Add the unsupported list for `GroupedData`

2023-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41927.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39437
[https://github.com/apache/spark/pull/39437]

> Add the unsupported list for `GroupedData`
> --
>
> Key: SPARK-41927
> URL: https://issues.apache.org/jira/browse/SPARK-41927
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark

2023-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41824:


Assignee: jiaan.geng

> Implement DataFrame.explain format to be similar to PySpark
> ---
>
> Key: SPARK-41824
> URL: https://issues.apache.org/jira/browse/SPARK-41824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: jiaan.geng
>Priority: Major
>
> {code:java}
> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
> "name"]) 
> df.explain()
> df.explain(True)
> df.explain(mode="formatted")
> df.explain("cost"){code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain()
> Expected:
>     == Physical Plan ==
>     *(1) Scan ExistingRDD[age...,name...]
> Got:
>     == Physical Plan ==
>     LocalTableScan [age#1148L, name#1149]
>     
>     
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain(mode="formatted")
> Expected:
>     == Physical Plan ==
>     * Scan ExistingRDD (...)
>     (1) Scan ExistingRDD [codegen id : ...]
>     Output [2]: [age..., name...]
>     ...
> Got:
>     == Physical Plan ==
>     LocalTableScan (1)
>     
>     
>     (1) LocalTableScan
>     Output [2]: [age#1170L, name#1171]
>     Arguments: [age#1170L, name#1171]
>     
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark

2023-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41824.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39436
[https://github.com/apache/spark/pull/39436]

> Implement DataFrame.explain format to be similar to PySpark
> ---
>
> Key: SPARK-41824
> URL: https://issues.apache.org/jira/browse/SPARK-41824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
> "name"]) 
> df.explain()
> df.explain(True)
> df.explain(mode="formatted")
> df.explain("cost"){code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain()
> Expected:
>     == Physical Plan ==
>     *(1) Scan ExistingRDD[age...,name...]
> Got:
>     == Physical Plan ==
>     LocalTableScan [age#1148L, name#1149]
>     
>     
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain(mode="formatted")
> Expected:
>     == Physical Plan ==
>     * Scan ExistingRDD (...)
>     (1) Scan ExistingRDD [codegen id : ...]
>     Output [2]: [age..., name...]
>     ...
> Got:
>     == Physical Plan ==
>     LocalTableScan (1)
>     
>     
>     (1) LocalTableScan
>     Output [2]: [age#1170L, name#1171]
>     Arguments: [age#1170L, name#1171]
>     
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41927) Add the unsupported list for `GroupedData`

2023-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41927:


Assignee: Ruifeng Zheng

> Add the unsupported list for `GroupedData`
> --
>
> Key: SPARK-41927
> URL: https://issues.apache.org/jira/browse/SPARK-41927
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41928) Add the unsupported list for functions

2023-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41928.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39438
[https://github.com/apache/spark/pull/39438]

> Add the unsupported list for functions
> --
>
> Key: SPARK-41928
> URL: https://issues.apache.org/jira/browse/SPARK-41928
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41928) Add the unsupported list for functions

2023-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41928:


Assignee: Ruifeng Zheng

> Add the unsupported list for functions
> --
>
> Key: SPARK-41928
> URL: https://issues.apache.org/jira/browse/SPARK-41928
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41929) Add function array_compact

2023-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41929:


Assignee: Ruifeng Zheng

> Add function array_compact
> --
>
> Key: SPARK-41929
> URL: https://issues.apache.org/jira/browse/SPARK-41929
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41930) Remove `branch-3.1` from publish_snapshot job

2023-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41930:
-

Assignee: Dongjoon Hyun

> Remove `branch-3.1` from publish_snapshot job
> -
>
> Key: SPARK-41930
> URL: https://issues.apache.org/jira/browse/SPARK-41930
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41930) Remove `branch-3.1` from publish_snapshot job

2023-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41930.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39440
[https://github.com/apache/spark/pull/39440]

> Remove `branch-3.1` from publish_snapshot job
> -
>
> Key: SPARK-41930
> URL: https://issues.apache.org/jira/browse/SPARK-41930
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41929) Add function array_compact

2023-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41929.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39439
[https://github.com/apache/spark/pull/39439]

> Add function array_compact
> --
>
> Key: SPARK-41929
> URL: https://issues.apache.org/jira/browse/SPARK-41929
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41931) Improve UNSUPPORTED_DATA_TYPE message for complex types

2023-01-06 Thread Serge Rielau (Jira)

Serge Rielau created SPARK-41931:


 Summary: Improve UNSUPPORTED_DATA_TYPE message for complex types
 Key: SPARK-41931
 URL: https://issues.apache.org/jira/browse/SPARK-41931
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Serge Rielau


spark-sql> SELECT CAST(array(1, 2, 3) AS ARRAY);

[UNSUPPORTED_DATATYPE] Unsupported data type "ARRAY"(line 1, pos 30)

== SQL ==

SELECT CAST(array(1, 2, 3) AS ARRAY)

--^^^

This error message is confusing. We support ARRAY. We just require it to be 
typed.
We should have an error like:
[INCOMPLETE_TYPE_DEFINITION.ARRAY] The definition of type `ARRAY` is 
incomplete. You must provide an element type. For example: `ARRAY\`.
Similarly for STRUCT and MAP.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41930) Remove `branch-3.1` from publish_snapshot job

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41930:


Assignee: Apache Spark

> Remove `branch-3.1` from publish_snapshot job
> -
>
> Key: SPARK-41930
> URL: https://issues.apache.org/jira/browse/SPARK-41930
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41930) Remove `branch-3.1` from publish_snapshot job

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655636#comment-17655636
 ] 

Apache Spark commented on SPARK-41930:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39440

> Remove `branch-3.1` from publish_snapshot job
> -
>
> Key: SPARK-41930
> URL: https://issues.apache.org/jira/browse/SPARK-41930
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41930) Remove `branch-3.1` from publish_snapshot job

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41930:


Assignee: (was: Apache Spark)

> Remove `branch-3.1` from publish_snapshot job
> -
>
> Key: SPARK-41930
> URL: https://issues.apache.org/jira/browse/SPARK-41930
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41930) Remove `branch-3.1` from publish_snapshot job

2023-01-06 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-41930:
-

 Summary: Remove `branch-3.1` from publish_snapshot job
 Key: SPARK-41930
 URL: https://issues.apache.org/jira/browse/SPARK-41930
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41904) Fix Function `nth_value` functions output

2023-01-06 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655629#comment-17655629
 ] 

jiaan.geng commented on SPARK-41904:


[~techaddict]Could you tell me how to reproduce this issue? I want take a look!

> Fix Function `nth_value` functions output
> -
>
> Key: SPARK-41904
> URL: https://issues.apache.org/jira/browse/SPARK-41904
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> from pyspark.sql import Window
> from pyspark.sql.functions import nth_value
> df = self.spark.createDataFrame(
> [
> ("a", 0, None),
> ("a", 1, "x"),
> ("a", 2, "y"),
> ("a", 3, "z"),
> ("a", 4, None),
> ("b", 1, None),
> ("b", 2, None),
> ],
> schema=("key", "order", "value"),
> )
> w = Window.partitionBy("key").orderBy("order")
> rs = df.select(
> df.key,
> df.order,
> nth_value("value", 2).over(w),
> nth_value("value", 2, False).over(w),
> nth_value("value", 2, True).over(w),
> ).collect()
> expected = [
> ("a", 0, None, None, None),
> ("a", 1, "x", "x", None),
> ("a", 2, "x", "x", "y"),
> ("a", 3, "x", "x", "y"),
> ("a", 4, "x", "x", "y"),
> ("b", 1, None, None, None),
> ("b", 2, None, None, None),
> ]
> for r, ex in zip(sorted(rs), sorted(expected)):
> self.assertEqual(tuple(r), ex[: len(r)]){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 755, in test_nth_value
> self.assertEqual(tuple(r), ex[: len(r)])
> AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x')
> First differing element 3:
> None
> 'x'
> - ('a', 1, 'x', None)
> ?   
> + ('a', 1, 'x', 'x')
> ?   ^^^
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41929) Add function array_compact

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655628#comment-17655628
 ] 

Apache Spark commented on SPARK-41929:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39439

> Add function array_compact
> --
>
> Key: SPARK-41929
> URL: https://issues.apache.org/jira/browse/SPARK-41929
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41929) Add function array_compact

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655627#comment-17655627
 ] 

Apache Spark commented on SPARK-41929:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39439

> Add function array_compact
> --
>
> Key: SPARK-41929
> URL: https://issues.apache.org/jira/browse/SPARK-41929
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41929) Add function array_compact

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41929:


Assignee: (was: Apache Spark)

> Add function array_compact
> --
>
> Key: SPARK-41929
> URL: https://issues.apache.org/jira/browse/SPARK-41929
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41929) Add function array_compact

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41929:


Assignee: Apache Spark

> Add function array_compact
> --
>
> Key: SPARK-41929
> URL: https://issues.apache.org/jira/browse/SPARK-41929
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41929) Add function array_compact

2023-01-06 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-41929:
-

 Summary: Add function array_compact
 Key: SPARK-41929
 URL: https://issues.apache.org/jira/browse/SPARK-41929
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41928) Add the unsupported list for functions

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655626#comment-17655626
 ] 

Apache Spark commented on SPARK-41928:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39438

> Add the unsupported list for functions
> --
>
> Key: SPARK-41928
> URL: https://issues.apache.org/jira/browse/SPARK-41928
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41928) Add the unsupported list for functions

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41928:


Assignee: (was: Apache Spark)

> Add the unsupported list for functions
> --
>
> Key: SPARK-41928
> URL: https://issues.apache.org/jira/browse/SPARK-41928
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41928) Add the unsupported list for functions

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41928:


Assignee: Apache Spark

> Add the unsupported list for functions
> --
>
> Key: SPARK-41928
> URL: https://issues.apache.org/jira/browse/SPARK-41928
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41928) Add the unsupported list for functions

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655625#comment-17655625
 ] 

Apache Spark commented on SPARK-41928:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39438

> Add the unsupported list for functions
> --
>
> Key: SPARK-41928
> URL: https://issues.apache.org/jira/browse/SPARK-41928
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41928) Add the unsupported list for functions

2023-01-06 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-41928:
-

 Summary: Add the unsupported list for functions
 Key: SPARK-41928
 URL: https://issues.apache.org/jira/browse/SPARK-41928
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41927) Add the unsupported list for `GroupedData`

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41927:


Assignee: (was: Apache Spark)

> Add the unsupported list for `GroupedData`
> --
>
> Key: SPARK-41927
> URL: https://issues.apache.org/jira/browse/SPARK-41927
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41927) Add the unsupported list for `GroupedData`

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655624#comment-17655624
 ] 

Apache Spark commented on SPARK-41927:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39437

> Add the unsupported list for `GroupedData`
> --
>
> Key: SPARK-41927
> URL: https://issues.apache.org/jira/browse/SPARK-41927
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41927) Add the unsupported list for `GroupedData`

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41927:


Assignee: Apache Spark

> Add the unsupported list for `GroupedData`
> --
>
> Key: SPARK-41927
> URL: https://issues.apache.org/jira/browse/SPARK-41927
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41824:


Assignee: (was: Apache Spark)

> Implement DataFrame.explain format to be similar to PySpark
> ---
>
> Key: SPARK-41824
> URL: https://issues.apache.org/jira/browse/SPARK-41824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
> "name"]) 
> df.explain()
> df.explain(True)
> df.explain(mode="formatted")
> df.explain("cost"){code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain()
> Expected:
>     == Physical Plan ==
>     *(1) Scan ExistingRDD[age...,name...]
> Got:
>     == Physical Plan ==
>     LocalTableScan [age#1148L, name#1149]
>     
>     
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain(mode="formatted")
> Expected:
>     == Physical Plan ==
>     * Scan ExistingRDD (...)
>     (1) Scan ExistingRDD [codegen id : ...]
>     Output [2]: [age..., name...]
>     ...
> Got:
>     == Physical Plan ==
>     LocalTableScan (1)
>     
>     
>     (1) LocalTableScan
>     Output [2]: [age#1170L, name#1171]
>     Arguments: [age#1170L, name#1171]
>     
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655623#comment-17655623
 ] 

Apache Spark commented on SPARK-41824:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/39436

> Implement DataFrame.explain format to be similar to PySpark
> ---
>
> Key: SPARK-41824
> URL: https://issues.apache.org/jira/browse/SPARK-41824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
> "name"]) 
> df.explain()
> df.explain(True)
> df.explain(mode="formatted")
> df.explain("cost"){code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain()
> Expected:
>     == Physical Plan ==
>     *(1) Scan ExistingRDD[age...,name...]
> Got:
>     == Physical Plan ==
>     LocalTableScan [age#1148L, name#1149]
>     
>     
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain(mode="formatted")
> Expected:
>     == Physical Plan ==
>     * Scan ExistingRDD (...)
>     (1) Scan ExistingRDD [codegen id : ...]
>     Output [2]: [age..., name...]
>     ...
> Got:
>     == Physical Plan ==
>     LocalTableScan (1)
>     
>     
>     (1) LocalTableScan
>     Output [2]: [age#1170L, name#1171]
>     Arguments: [age#1170L, name#1171]
>     
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41824:


Assignee: Apache Spark

> Implement DataFrame.explain format to be similar to PySpark
> ---
>
> Key: SPARK-41824
> URL: https://issues.apache.org/jira/browse/SPARK-41824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
> "name"]) 
> df.explain()
> df.explain(True)
> df.explain(mode="formatted")
> df.explain("cost"){code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain()
> Expected:
>     == Physical Plan ==
>     *(1) Scan ExistingRDD[age...,name...]
> Got:
>     == Physical Plan ==
>     LocalTableScan [age#1148L, name#1149]
>     
>     
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain(mode="formatted")
> Expected:
>     == Physical Plan ==
>     * Scan ExistingRDD (...)
>     (1) Scan ExistingRDD [codegen id : ...]
>     Output [2]: [age..., name...]
>     ...
> Got:
>     == Physical Plan ==
>     LocalTableScan (1)
>     
>     
>     (1) LocalTableScan
>     Output [2]: [age#1170L, name#1171]
>     Arguments: [age#1170L, name#1171]
>     
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41927) Add the unsupported list for `GroupedData`

2023-01-06 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-41927:
-

 Summary: Add the unsupported list for `GroupedData`
 Key: SPARK-41927
 URL: https://issues.apache.org/jira/browse/SPARK-41927
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41926) Add Github action test job with RocksDB as UI backend

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41926:


Assignee: Gengliang Wang  (was: Apache Spark)

> Add Github action test job with RocksDB as UI backend
> -
>
> Key: SPARK-41926
> URL: https://issues.apache.org/jira/browse/SPARK-41926
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41926) Add Github action test job with RocksDB as UI backend

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41926:


Assignee: Apache Spark  (was: Gengliang Wang)

> Add Github action test job with RocksDB as UI backend
> -
>
> Key: SPARK-41926
> URL: https://issues.apache.org/jira/browse/SPARK-41926
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41926) Add Github action test job with RocksDB as UI backend

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655622#comment-17655622
 ] 

Apache Spark commented on SPARK-41926:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39435

> Add Github action test job with RocksDB as UI backend
> -
>
> Key: SPARK-41926
> URL: https://issues.apache.org/jira/browse/SPARK-41926
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41926) Add Github action test job with RocksDB as UI backend

2023-01-06 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-41926:
--

 Summary: Add Github action test job with RocksDB as UI backend
 Key: SPARK-41926
 URL: https://issues.apache.org/jira/browse/SPARK-41926
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41875) Throw proper errors in Dataset.to()

2023-01-06 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41875.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39422
[https://github.com/apache/spark/pull/39422]

> Throw proper errors in Dataset.to()
> ---
>
> Key: SPARK-41875
> URL: https://issues.apache.org/jira/browse/SPARK-41875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41875) Throw proper errors in Dataset.to()

2023-01-06 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41875:
-

Assignee: jiaan.geng

> Throw proper errors in Dataset.to()
> ---
>
> Key: SPARK-41875
> URL: https://issues.apache.org/jira/browse/SPARK-41875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: jiaan.geng
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument

2023-01-06 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41898.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39433
[https://github.com/apache/spark/pull/39433]

> Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as 
> argument
> 
>
> Key: SPARK-41898
> URL: https://issues.apache.org/jira/browse/SPARK-41898
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
> ["key", "value"])
> w = Window.partitionBy("value").orderBy("key")
> from pyspark.sql import functions as F
> sel = df.select(
> df.value,
> df.key,
> F.max("key").over(w.rowsBetween(0, 1)),
> F.min("key").over(w.rowsBetween(0, 1)),
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),
> F.row_number().over(w),
> F.rank().over(w),
> F.dense_rank().over(w),
> F.ntile(2).over(w),
> )
> rs = sorted(sel.collect()){code}
> {code:java}
> Traceback (most recent call last):   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 821, in test_window_functions     
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", 
> line 152, in rowsBetween     raise TypeError(f"start must be a int, but got 
> {type(start).__name__}") TypeError: start must be a int, but got float {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument

2023-01-06 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41898:
-

Assignee: Sandeep Singh

> Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as 
> argument
> 
>
> Key: SPARK-41898
> URL: https://issues.apache.org/jira/browse/SPARK-41898
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
> ["key", "value"])
> w = Window.partitionBy("value").orderBy("key")
> from pyspark.sql import functions as F
> sel = df.select(
> df.value,
> df.key,
> F.max("key").over(w.rowsBetween(0, 1)),
> F.min("key").over(w.rowsBetween(0, 1)),
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),
> F.row_number().over(w),
> F.rank().over(w),
> F.dense_rank().over(w),
> F.ntile(2).over(w),
> )
> rs = sorted(sel.collect()){code}
> {code:java}
> Traceback (most recent call last):   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 821, in test_window_functions     
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", 
> line 152, in rowsBetween     raise TypeError(f"start must be a int, but got 
> {type(start).__name__}") TypeError: start must be a int, but got float {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41924) Make StructType support metadata and Implement `DataFrame.withMetadata`

2023-01-06 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41924:
-

Assignee: Ruifeng Zheng

> Make StructType support metadata and Implement `DataFrame.withMetadata`
> ---
>
> Key: SPARK-41924
> URL: https://issues.apache.org/jira/browse/SPARK-41924
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41924) Make StructType support metadata and Implement `DataFrame.withMetadata`

2023-01-06 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41924.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39432
[https://github.com/apache/spark/pull/39432]

> Make StructType support metadata and Implement `DataFrame.withMetadata`
> ---
>
> Key: SPARK-41924
> URL: https://issues.apache.org/jira/browse/SPARK-41924
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35743) Improve Parquet vectorized reader

2023-01-06 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655604#comment-17655604
 ] 

Dongjoon Hyun commented on SPARK-35743:
---

Thank you for updating!

> Improve Parquet vectorized reader
> -
>
> Key: SPARK-35743
> URL: https://issues.apache.org/jira/browse/SPARK-35743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: parquet, releasenotes
> Fix For: 3.4.0
>
>
> This umbrella JIRA tracks efforts to improve vectorized Parquet reader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35743) Improve Parquet vectorized reader

2023-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35743:
--
Labels: parquet releasenotes  (was: parquet)

> Improve Parquet vectorized reader
> -
>
> Key: SPARK-35743
> URL: https://issues.apache.org/jira/browse/SPARK-35743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: parquet, releasenotes
> Fix For: 3.4.0
>
>
> This umbrella JIRA tracks efforts to improve vectorized Parquet reader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader

2023-01-06 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36529:
-
Parent: (was: SPARK-35743)
Issue Type: Bug  (was: Sub-task)

> Decouple CPU with IO work in vectorized Parquet reader
> --
>
> Key: SPARK-36529
> URL: https://issues.apache.org/jira/browse/SPARK-36529
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently it seems the vectorized Parquet reader does almost everything in a 
> sequential manner:
> 1. read the row group using file system API (perhaps from remote storage like 
> S3)
> 2. allocate buffers and store those row group bytes into them
> 3. decompress the data pages
> 4. in Spark, decode all the read columns one by one
> 5. read the next row group and repeat from 1.
> A lot of improvements can be done to decouple the IO and CPU intensive work. 
> In addition, we could parallelize the row group loading and column decoding, 
> and utilizing all the cores available for a Spark task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader

2023-01-06 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36529:
-
Issue Type: Improvement  (was: Bug)

> Decouple CPU with IO work in vectorized Parquet reader
> --
>
> Key: SPARK-36529
> URL: https://issues.apache.org/jira/browse/SPARK-36529
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently it seems the vectorized Parquet reader does almost everything in a 
> sequential manner:
> 1. read the row group using file system API (perhaps from remote storage like 
> S3)
> 2. allocate buffers and store those row group bytes into them
> 3. decompress the data pages
> 4. in Spark, decode all the read columns one by one
> 5. read the next row group and repeat from 1.
> A lot of improvements can be done to decouple the IO and CPU intensive work. 
> In addition, we could parallelize the row group loading and column decoding, 
> and utilizing all the cores available for a Spark task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default

2023-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41925.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39434
[https://github.com/apache/spark/pull/39434]

> Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
> --
>
> Key: SPARK-41925
> URL: https://issues.apache.org/jira/browse/SPARK-41925
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>
> Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims 
> to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36528) Implement lazy decoding for the vectorized Parquet reader

2023-01-06 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36528:
-
Parent: (was: SPARK-35743)
Issue Type: Bug  (was: Sub-task)

> Implement lazy decoding for the vectorized Parquet reader
> -
>
> Key: SPARK-36528
> URL: https://issues.apache.org/jira/browse/SPARK-36528
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector 
> and then operate on the decoded data. However, it may be more efficient to 
> directly operate on encoded data, for instance, performing filter or 
> aggregation on RLE-encoded data, or performing comparison over 
> dictionary-encoded string data. This can also potentially work with encodings 
> in Parquet v2 format, such as DELTA_BYTE_ARRAY.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36528) Implement lazy decoding for the vectorized Parquet reader

2023-01-06 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36528:
-
Issue Type: New Feature  (was: Bug)

> Implement lazy decoding for the vectorized Parquet reader
> -
>
> Key: SPARK-36528
> URL: https://issues.apache.org/jira/browse/SPARK-36528
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector 
> and then operate on the decoded data. However, it may be more efficient to 
> directly operate on encoded data, for instance, performing filter or 
> aggregation on RLE-encoded data, or performing comparison over 
> dictionary-encoded string data. This can also potentially work with encodings 
> in Parquet v2 format, such as DELTA_BYTE_ARRAY.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default

2023-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41925:
-

Assignee: Dongjoon Hyun

> Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
> --
>
> Key: SPARK-41925
> URL: https://issues.apache.org/jira/browse/SPARK-41925
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims 
> to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35743) Improve Parquet vectorized reader

2023-01-06 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-35743.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

> Improve Parquet vectorized reader
> -
>
> Key: SPARK-35743
> URL: https://issues.apache.org/jira/browse/SPARK-35743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: parquet
> Fix For: 3.4.0
>
>
> This umbrella JIRA tracks efforts to improve vectorized Parquet reader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36527) Implement lazy materialization for the vectorized Parquet reader

2023-01-06 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36527:
-
Parent: (was: SPARK-35743)
Issue Type: Improvement  (was: Sub-task)

> Implement lazy materialization for the vectorized Parquet reader
> 
>
> Key: SPARK-36527
> URL: https://issues.apache.org/jira/browse/SPARK-36527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> At the moment the Parquet vectorized reader will eagerly decode all the 
> columns that are in the read schema, before any filter has been applied to 
> them. This is costly. Instead it's better to only materialize these column 
> vectors when the data are actually needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35743) Improve Parquet vectorized reader

2023-01-06 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655560#comment-17655560
 ] 

Dongjoon Hyun commented on SPARK-35743:
---

Hi, [~csun] . Please update the JIRA's target version and label field if you 
want to have this at 3.4.0.

> Improve Parquet vectorized reader
> -
>
> Key: SPARK-35743
> URL: https://issues.apache.org/jira/browse/SPARK-35743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: parquet
>
> This umbrella JIRA tracks efforts to improve vectorized Parquet reader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35743) Improve Parquet vectorized reader

2023-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35743:
--
Target Version/s:   (was: 3.3.0)

> Improve Parquet vectorized reader
> -
>
> Key: SPARK-35743
> URL: https://issues.apache.org/jira/browse/SPARK-35743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: parquet
>
> This umbrella JIRA tracks efforts to improve vectorized Parquet reader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41895) Add tests for streaming UI with RocksDB backend

2023-01-06 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-41895.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39415
[https://github.com/apache/spark/pull/39415]

> Add tests for streaming UI with RocksDB backend
> ---
>
> Key: SPARK-41895
> URL: https://issues.apache.org/jira/browse/SPARK-41895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1767#comment-1767
 ] 

Apache Spark commented on SPARK-41925:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39434

> Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
> --
>
> Key: SPARK-41925
> URL: https://issues.apache.org/jira/browse/SPARK-41925
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims 
> to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1766#comment-1766
 ] 

Apache Spark commented on SPARK-41925:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39434

> Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
> --
>
> Key: SPARK-41925
> URL: https://issues.apache.org/jira/browse/SPARK-41925
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims 
> to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41925:


Assignee: (was: Apache Spark)

> Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
> --
>
> Key: SPARK-41925
> URL: https://issues.apache.org/jira/browse/SPARK-41925
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims 
> to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41925:


Assignee: Apache Spark

> Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
> --
>
> Key: SPARK-41925
> URL: https://issues.apache.org/jira/browse/SPARK-41925
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims 
> to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default

2023-01-06 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-41925:
-

 Summary: Enable spark.sql.orc.enableNestedColumnVectorizedReader 
by default
 Key: SPARK-41925
 URL: https://issues.apache.org/jira/browse/SPARK-41925
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun


Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims to 
enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41918) Refine the naming in proto messages

2023-01-06 Thread Rui Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1764#comment-1764
 ] 

Rui Wang commented on SPARK-41918:
--

I did some tests locally and find something as the below:

If I rename a field, of course the code that access the field must be updated.

Then in terms of backwards compatibility, the client uses old named field can 
talk to the server uses the new named field without a problem.

also in terms of forwards compatibility, it works nicely. 


So now probably I know it better: renaming fields only require to recompile the 
code after that binaries are supposed to work as before. 


> Refine the naming in proto messages
> ---
>
> Key: SPARK-41918
> URL: https://issues.apache.org/jira/browse/SPARK-41918
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> normally, we name the fields after the corresponding LogiclalPlan or 
> DataFrame API, but they are not consistent in protos, for example, the column 
> name:
> {code:java}
>   message UnresolvedRegex {
> // (Required) The column name used to extract column with regex.
> string col_name = 1;
>   }
> {code}
> {code:java}
>   message Alias {
> // (Required) The expression that alias will be added on.
> Expression expr = 1;
> // (Required) a list of name parts for the alias.
> //
> // Scalar columns only has one name that presents.
> repeated string name = 2;
> // (Optional) Alias metadata expressed as a JSON map.
> optional string metadata = 3;
>   }
> {code}
> {code:java}
> // Relation of type [[Deduplicate]] which have duplicate rows removed, could 
> consider either only
> // the subset of columns or all the columns.
> message Deduplicate {
>   // (Required) Input relation for a Deduplicate.
>   Relation input = 1;
>   // (Optional) Deduplicate based on a list of column names.
>   //
>   // This field does not co-use with `all_columns_as_keys`.
>   repeated string column_names = 2;
>   // (Optional) Deduplicate based on all the columns of the input relation.
>   //
>   // This field does not co-use with `column_names`.
>   optional bool all_columns_as_keys = 3;
> }
> {code}
> {code:java}
> // Computes basic statistics for numeric and string columns, including count, 
> mean, stddev, min,
> // and max. If no columns are given, this function computes statistics for 
> all numerical or
> // string columns.
> message StatDescribe {
>   // (Required) The input relation.
>   Relation input = 1;
>   // (Optional) Columns to compute statistics on.
>   repeated string cols = 2;
> }
> {code}
> we probably should unify the naming:
> single column -> `column`
> multi columns -> `columns`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41820) DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement failed

2023-01-06 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41820:
--
Description: 
{code:java}
>>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", 
>>> "name"])
>>> df.createOrReplaceGlobalTempView("people") {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1292, in 
pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView
Failed example:
    df2.createOrReplaceGlobalTempView("people")
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", 
line 1, in 
        df2.createOrReplaceGlobalTempView("people")
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1192, in createOrReplaceGlobalTempView
        self._session.client.execute_command(command)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
459, in execute_command
        self._execute(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
547, in _execute
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
625, in _handle_error
        raise SparkConnectException(status.message) from None
    pyspark.sql.connect.client.SparkConnectException: requirement failed 

{code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1292, in 
pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView
Failed example:
    df2.createOrReplaceGlobalTempView("people")
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", 
line 1, in 
        df2.createOrReplaceGlobalTempView("people")
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1192, in createOrReplaceGlobalTempView
        self._session.client.execute_command(command)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
459, in execute_command
        self._execute(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
547, in _execute
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
625, in _handle_error
        raise SparkConnectException(status.message) from None
    pyspark.sql.connect.client.SparkConnectException: requirement failed {code}


> DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement 
> failed
> ---
>
> Key: SPARK-41820
> URL: https://issues.apache.org/jira/browse/SPARK-41820
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", 
> >>> "name"])
> >>> df.createOrReplaceGlobalTempView("people") {code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1292, in 
> pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView
> Failed example:
>     df2.createOrReplaceGlobalTempView("people")
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView[3]>", 
> line 1, in 
>         df2.createOrReplaceGlobalTempView("people")
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1192, in createOrReplaceGlobalTempView
>         self._session.client.execute_command(command)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 459, in execute_command
>         self._execute(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 547, in _execute
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 625, in _handle_error
>         raise SparkConnectException(statu

[jira] [Comment Edited] (SPARK-41918) Refine the naming in proto messages

2023-01-06 Thread Rui Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655545#comment-17655545
 ] 

Rui Wang edited comment on SPARK-41918 at 1/6/23 6:35 PM:
--

[~grundprinzip-db]

I am a bit confused on the renaming and what compatibility it offers:

```
message Foo {
   int a = 1;
}
``` 

On the receiver side it access the a
val t = foo.a + 1


then we allow rename field
```
message Foo {
   int b = 1;
}
``` 

Any renaming will break the receiver side's code? Do I misunderstand `WIRE 
compatibility` that the receiver should be able to read the output after the 
wire?



was (Author: amaliujia):
[~grundprinzip-db]

I am a bit confused on the renaming and what compatibility it offers:

```
message Foo {
   int a = 1;
}
``` 

On the receiver side it access the a
val t = foo.a + 1


Any renaming will break the receiver side's code? Do I misunderstand `WIRE 
compatibility` that the receiver should be able to read the output after the 
wire?


> Refine the naming in proto messages
> ---
>
> Key: SPARK-41918
> URL: https://issues.apache.org/jira/browse/SPARK-41918
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> normally, we name the fields after the corresponding LogiclalPlan or 
> DataFrame API, but they are not consistent in protos, for example, the column 
> name:
> {code:java}
>   message UnresolvedRegex {
> // (Required) The column name used to extract column with regex.
> string col_name = 1;
>   }
> {code}
> {code:java}
>   message Alias {
> // (Required) The expression that alias will be added on.
> Expression expr = 1;
> // (Required) a list of name parts for the alias.
> //
> // Scalar columns only has one name that presents.
> repeated string name = 2;
> // (Optional) Alias metadata expressed as a JSON map.
> optional string metadata = 3;
>   }
> {code}
> {code:java}
> // Relation of type [[Deduplicate]] which have duplicate rows removed, could 
> consider either only
> // the subset of columns or all the columns.
> message Deduplicate {
>   // (Required) Input relation for a Deduplicate.
>   Relation input = 1;
>   // (Optional) Deduplicate based on a list of column names.
>   //
>   // This field does not co-use with `all_columns_as_keys`.
>   repeated string column_names = 2;
>   // (Optional) Deduplicate based on all the columns of the input relation.
>   //
>   // This field does not co-use with `column_names`.
>   optional bool all_columns_as_keys = 3;
> }
> {code}
> {code:java}
> // Computes basic statistics for numeric and string columns, including count, 
> mean, stddev, min,
> // and max. If no columns are given, this function computes statistics for 
> all numerical or
> // string columns.
> message StatDescribe {
>   // (Required) The input relation.
>   Relation input = 1;
>   // (Optional) Columns to compute statistics on.
>   repeated string cols = 2;
> }
> {code}
> we probably should unify the naming:
> single column -> `column`
> multi columns -> `columns`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41918) Refine the naming in proto messages

2023-01-06 Thread Rui Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655545#comment-17655545
 ] 

Rui Wang commented on SPARK-41918:
--

[~grundprinzip-db]

I am a bit confused on the renaming and what compatibility it offers:

```
message Foo {
   int a = 1;
}
``` 

On the receiver side it access the a
val t = foo.a + 1


Any renaming will break the receiver side's code? Do I misunderstand `WIRE 
compatibility` that the receiver should be able to read the output after the 
wire?


> Refine the naming in proto messages
> ---
>
> Key: SPARK-41918
> URL: https://issues.apache.org/jira/browse/SPARK-41918
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> normally, we name the fields after the corresponding LogiclalPlan or 
> DataFrame API, but they are not consistent in protos, for example, the column 
> name:
> {code:java}
>   message UnresolvedRegex {
> // (Required) The column name used to extract column with regex.
> string col_name = 1;
>   }
> {code}
> {code:java}
>   message Alias {
> // (Required) The expression that alias will be added on.
> Expression expr = 1;
> // (Required) a list of name parts for the alias.
> //
> // Scalar columns only has one name that presents.
> repeated string name = 2;
> // (Optional) Alias metadata expressed as a JSON map.
> optional string metadata = 3;
>   }
> {code}
> {code:java}
> // Relation of type [[Deduplicate]] which have duplicate rows removed, could 
> consider either only
> // the subset of columns or all the columns.
> message Deduplicate {
>   // (Required) Input relation for a Deduplicate.
>   Relation input = 1;
>   // (Optional) Deduplicate based on a list of column names.
>   //
>   // This field does not co-use with `all_columns_as_keys`.
>   repeated string column_names = 2;
>   // (Optional) Deduplicate based on all the columns of the input relation.
>   //
>   // This field does not co-use with `column_names`.
>   optional bool all_columns_as_keys = 3;
> }
> {code}
> {code:java}
> // Computes basic statistics for numeric and string columns, including count, 
> mean, stddev, min,
> // and max. If no columns are given, this function computes statistics for 
> all numerical or
> // string columns.
> message StatDescribe {
>   // (Required) The input relation.
>   Relation input = 1;
>   // (Optional) Columns to compute statistics on.
>   repeated string cols = 2;
> }
> {code}
> we probably should unify the naming:
> single column -> `column`
> multi columns -> `columns`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41898:


Assignee: (was: Apache Spark)

> Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as 
> argument
> 
>
> Key: SPARK-41898
> URL: https://issues.apache.org/jira/browse/SPARK-41898
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
> ["key", "value"])
> w = Window.partitionBy("value").orderBy("key")
> from pyspark.sql import functions as F
> sel = df.select(
> df.value,
> df.key,
> F.max("key").over(w.rowsBetween(0, 1)),
> F.min("key").over(w.rowsBetween(0, 1)),
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),
> F.row_number().over(w),
> F.rank().over(w),
> F.dense_rank().over(w),
> F.ntile(2).over(w),
> )
> rs = sorted(sel.collect()){code}
> {code:java}
> Traceback (most recent call last):   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 821, in test_window_functions     
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", 
> line 152, in rowsBetween     raise TypeError(f"start must be a int, but got 
> {type(start).__name__}") TypeError: start must be a int, but got float {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41898:


Assignee: Apache Spark

> Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as 
> argument
> 
>
> Key: SPARK-41898
> URL: https://issues.apache.org/jira/browse/SPARK-41898
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
> ["key", "value"])
> w = Window.partitionBy("value").orderBy("key")
> from pyspark.sql import functions as F
> sel = df.select(
> df.value,
> df.key,
> F.max("key").over(w.rowsBetween(0, 1)),
> F.min("key").over(w.rowsBetween(0, 1)),
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),
> F.row_number().over(w),
> F.rank().over(w),
> F.dense_rank().over(w),
> F.ntile(2).over(w),
> )
> rs = sorted(sel.collect()){code}
> {code:java}
> Traceback (most recent call last):   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 821, in test_window_functions     
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", 
> line 152, in rowsBetween     raise TypeError(f"start must be a int, but got 
> {type(start).__name__}") TypeError: start must be a int, but got float {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655539#comment-17655539
 ] 

Apache Spark commented on SPARK-41898:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39433

> Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as 
> argument
> 
>
> Key: SPARK-41898
> URL: https://issues.apache.org/jira/browse/SPARK-41898
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
> ["key", "value"])
> w = Window.partitionBy("value").orderBy("key")
> from pyspark.sql import functions as F
> sel = df.select(
> df.value,
> df.key,
> F.max("key").over(w.rowsBetween(0, 1)),
> F.min("key").over(w.rowsBetween(0, 1)),
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),
> F.row_number().over(w),
> F.rank().over(w),
> F.dense_rank().over(w),
> F.ntile(2).over(w),
> )
> rs = sorted(sel.collect()){code}
> {code:java}
> Traceback (most recent call last):   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 821, in test_window_functions     
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", 
> line 152, in rowsBetween     raise TypeError(f"start must be a int, but got 
> {type(start).__name__}") TypeError: start must be a int, but got float {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41911) Add version fields to Connect proto

2023-01-06 Thread Rui Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655537#comment-17655537
 ] 

Rui Wang commented on SPARK-41911:
--

We will know better where we need the versions during the process of defining 
the compatibility requirement.

> Add version fields to Connect proto
> ---
>
> Key: SPARK-41911
> URL: https://issues.apache.org/jira/browse/SPARK-41911
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>
> We may need this to help maintain compatibility. Depending on the concrete 
> protocol design, we may use field number 1 for version fields thus may cause 
> breaking changes on existing proto messages. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2023-01-06 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655532#comment-17655532
 ] 

L. C. Hsieh commented on SPARK-41049:
-

For a correctness bug, I think we should backport it, though the patch is a 
kind of refactoring work.

> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Fix For: 3.4.0
>
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> df.select(rand(), rand()).collect{code}
> In this case, because the rand() calls are distinct, the values in both 
> columns should be different.
> h2. Problem
> This expectation does not appear to be stable in the event that any 
> subsequent expression is a CodegenFallback. This program:
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> val sparkSession = SparkSession.builder().getOrCreate()
> val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
> df.select(v1, v1, v2, v2).collect {code}
> produces output like this:
> |8159|8159|8159|{color:#ff}2028{color}|
> |8320|8320|8320|{color:#ff}1640{color}|
> |7937|7937|7937|{color:#ff}769{color}|
> |436|436|436|{color:#ff}8924{color}|
> |8924|8924|2827|{color:#ff}2731{color}|
> Not sure why the first call via the CodegenFallback path should be correct 
> while subsequent calls aren't.
> h2. Workaround
> If the Nondeterministic expression is moved to a separate, earlier select() 
> call, so the CodegenFallback instead only refers to a column reference, then 
> the problem seems to go away. But this workaround may not be reliable if 
> optimization is ever able to restructure adjacent select()s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2023-01-06 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655516#comment-17655516
 ] 

Erik Krogen commented on SPARK-41049:
-

Thanks! [~cloud_fan]  [~viirya]  shall we backport this to branch-3.3 and 
branch-3.2, given it is a correctness bug?

> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Fix For: 3.4.0
>
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> df.select(rand(), rand()).collect{code}
> In this case, because the rand() calls are distinct, the values in both 
> columns should be different.
> h2. Problem
> This expectation does not appear to be stable in the event that any 
> subsequent expression is a CodegenFallback. This program:
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> val sparkSession = SparkSession.builder().getOrCreate()
> val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
> df.select(v1, v1, v2, v2).collect {code}
> produces output like this:
> |8159|8159|8159|{color:#ff}2028{color}|
> |8320|8320|8320|{color:#ff}1640{color}|
> |7937|7937|7937|{color:#ff}769{color}|
> |436|436|436|{color:#ff}8924{color}|
> |8924|8924|2827|{color:#ff}2731{color}|
> Not sure why the first call via the CodegenFallback path should be correct 
> while subsequent calls aren't.
> h2. Workaround
> If the Nondeterministic expression is moved to a separate, earlier select() 
> call, so the CodegenFallback instead only refers to a column reference, then 
> the problem seems to go away. But this workaround may not be reliable if 
> optimization is ever able to restructure adjacent select()s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41890) Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for Scala 2.13

2023-01-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-41890.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39399
[https://github.com/apache/spark/pull/39399]

> Reduce `toSeq` in 
> `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for 
> Scala 2.13
> --
>
> Key: SPARK-41890
> URL: https://issues.apache.org/jira/browse/SPARK-41890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Web UI
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> Similar work to SPARK-41709



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41800) Upgrade commons-compress to 1.22

2023-01-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-41800:


Assignee: BingKun Pan

> Upgrade commons-compress to 1.22
> 
>
> Key: SPARK-41800
> URL: https://issues.apache.org/jira/browse/SPARK-41800
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41800) Upgrade commons-compress to 1.22

2023-01-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-41800.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39326
[https://github.com/apache/spark/pull/39326]

> Upgrade commons-compress to 1.22
> 
>
> Key: SPARK-41800
> URL: https://issues.apache.org/jira/browse/SPARK-41800
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41890) Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for Scala 2.13

2023-01-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-41890:


Assignee: Yang Jie

> Reduce `toSeq` in 
> `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for 
> Scala 2.13
> --
>
> Key: SPARK-41890
> URL: https://issues.apache.org/jira/browse/SPARK-41890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Web UI
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Similar work to SPARK-41709



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark

2023-01-06 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655442#comment-17655442
 ] 

Hyukjin Kwon commented on SPARK-41824:
--

It's actually implementation details in PySpark. It would be difficult to make 
it matched. Let's either fix the test to be compatiable for both cases, or 
simply skip it with {{doctest: +SKIP}}

> Implement DataFrame.explain format to be similar to PySpark
> ---
>
> Key: SPARK-41824
> URL: https://issues.apache.org/jira/browse/SPARK-41824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
> "name"]) 
> df.explain()
> df.explain(True)
> df.explain(mode="formatted")
> df.explain("cost"){code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain()
> Expected:
>     == Physical Plan ==
>     *(1) Scan ExistingRDD[age...,name...]
> Got:
>     == Physical Plan ==
>     LocalTableScan [age#1148L, name#1149]
>     
>     
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain(mode="formatted")
> Expected:
>     == Physical Plan ==
>     * Scan ExistingRDD (...)
>     (1) Scan ExistingRDD [codegen id : ...]
>     Output [2]: [age..., name...]
>     ...
> Got:
>     == Physical Plan ==
>     LocalTableScan (1)
>     
>     
>     (1) LocalTableScan
>     Output [2]: [age#1170L, name#1171]
>     Arguments: [age#1170L, name#1171]
>     
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41924) Make StructType support metadata and Implement `DataFrame.withMetadata`

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41924:


Assignee: Apache Spark

> Make StructType support metadata and Implement `DataFrame.withMetadata`
> ---
>
> Key: SPARK-41924
> URL: https://issues.apache.org/jira/browse/SPARK-41924
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41924) Make StructType support metadata and Implement `DataFrame.withMetadata`

2023-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41924:


Assignee: (was: Apache Spark)

> Make StructType support metadata and Implement `DataFrame.withMetadata`
> ---
>
> Key: SPARK-41924
> URL: https://issues.apache.org/jira/browse/SPARK-41924
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41924) Make StructType support metadata and Implement `DataFrame.withMetadata`

2023-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655436#comment-17655436
 ] 

Apache Spark commented on SPARK-41924:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39432

> Make StructType support metadata and Implement `DataFrame.withMetadata`
> ---
>
> Key: SPARK-41924
> URL: https://issues.apache.org/jira/browse/SPARK-41924
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41924) Make StructType support metadata and Implement `DataFrame.withMetadata`

2023-01-06 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-41924:
-

 Summary: Make StructType support metadata and Implement 
`DataFrame.withMetadata`
 Key: SPARK-41924
 URL: https://issues.apache.org/jira/browse/SPARK-41924
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark

2023-01-06 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655434#comment-17655434
 ] 

jiaan.geng edited comment on SPARK-41824 at 1/6/23 12:56 PM:
-

I found scala API Dataset.explain print the same output as connect.dataframe.

{code:java}
== Physical Plan ==
*(1) Project [_1#x AS age#x, _2#x AS name#x]
+- *(1) LocalTableScan [_1#x, _2#x]



{code}
So, do we need follow the behavior of pyspark or scala API ?


was (Author: beliefer):
I found scala API Dataset.explain print the same output as connect.dataframe.

{code:java}
== Physical Plan ==
*(1) Project [_1#x AS age#x, _2#x AS name#x]
+- *(1) LocalTableScan [_1#x, _2#x]



{code}


> Implement DataFrame.explain format to be similar to PySpark
> ---
>
> Key: SPARK-41824
> URL: https://issues.apache.org/jira/browse/SPARK-41824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
> "name"]) 
> df.explain()
> df.explain(True)
> df.explain(mode="formatted")
> df.explain("cost"){code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain()
> Expected:
>     == Physical Plan ==
>     *(1) Scan ExistingRDD[age...,name...]
> Got:
>     == Physical Plan ==
>     LocalTableScan [age#1148L, name#1149]
>     
>     
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain(mode="formatted")
> Expected:
>     == Physical Plan ==
>     * Scan ExistingRDD (...)
>     (1) Scan ExistingRDD [codegen id : ...]
>     Output [2]: [age..., name...]
>     ...
> Got:
>     == Physical Plan ==
>     LocalTableScan (1)
>     
>     
>     (1) LocalTableScan
>     Output [2]: [age#1170L, name#1171]
>     Arguments: [age#1170L, name#1171]
>     
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark

2023-01-06 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655434#comment-17655434
 ] 

jiaan.geng edited comment on SPARK-41824 at 1/6/23 12:56 PM:
-

I found scala API Dataset.explain print the same output as connect.dataframe.

{code:java}
== Physical Plan ==
*(1) Project [_1#x AS age#x, _2#x AS name#x]
+- *(1) LocalTableScan [_1#x, _2#x]



{code}
So, do we need follow the behavior of pyspark or scala API ? cc [~gurwls223]


was (Author: beliefer):
I found scala API Dataset.explain print the same output as connect.dataframe.

{code:java}
== Physical Plan ==
*(1) Project [_1#x AS age#x, _2#x AS name#x]
+- *(1) LocalTableScan [_1#x, _2#x]



{code}
So, do we need follow the behavior of pyspark or scala API ?

> Implement DataFrame.explain format to be similar to PySpark
> ---
>
> Key: SPARK-41824
> URL: https://issues.apache.org/jira/browse/SPARK-41824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
> "name"]) 
> df.explain()
> df.explain(True)
> df.explain(mode="formatted")
> df.explain("cost"){code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain()
> Expected:
>     == Physical Plan ==
>     *(1) Scan ExistingRDD[age...,name...]
> Got:
>     == Physical Plan ==
>     LocalTableScan [age#1148L, name#1149]
>     
>     
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain(mode="formatted")
> Expected:
>     == Physical Plan ==
>     * Scan ExistingRDD (...)
>     (1) Scan ExistingRDD [codegen id : ...]
>     Output [2]: [age..., name...]
>     ...
> Got:
>     == Physical Plan ==
>     LocalTableScan (1)
>     
>     
>     (1) LocalTableScan
>     Output [2]: [age#1170L, name#1171]
>     Arguments: [age#1170L, name#1171]
>     
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 144 matches

Mail list logo