[jira] [Resolved] (SPARK-41935) Skip snapshot check and transfer progress log during publishing snapshots
[ https://issues.apache.org/jira/browse/SPARK-41935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41935. --- Resolution: Invalid > Skip snapshot check and transfer progress log during publishing snapshots > - > > Key: SPARK-41935 > URL: https://issues.apache.org/jira/browse/SPARK-41935 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41935) Skip snapshot check and transfer progress log during publishing snapshots
[ https://issues.apache.org/jira/browse/SPARK-41935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41935: Assignee: (was: Apache Spark) > Skip snapshot check and transfer progress log during publishing snapshots > - > > Key: SPARK-41935 > URL: https://issues.apache.org/jira/browse/SPARK-41935 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41935) Skip snapshot check and transfer progress log during publishing snapshots
[ https://issues.apache.org/jira/browse/SPARK-41935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41935: Assignee: Apache Spark > Skip snapshot check and transfer progress log during publishing snapshots > - > > Key: SPARK-41935 > URL: https://issues.apache.org/jira/browse/SPARK-41935 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41935) Skip snapshot check and transfer progress log during publishing snapshots
[ https://issues.apache.org/jira/browse/SPARK-41935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655659#comment-17655659 ] Apache Spark commented on SPARK-41935: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39443 > Skip snapshot check and transfer progress log during publishing snapshots > - > > Key: SPARK-41935 > URL: https://issues.apache.org/jira/browse/SPARK-41935 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41935) Skip snapshot check and transfer progress log in release-build.sh
Dongjoon Hyun created SPARK-41935: - Summary: Skip snapshot check and transfer progress log in release-build.sh Key: SPARK-41935 URL: https://issues.apache.org/jira/browse/SPARK-41935 Project: Spark Issue Type: Task Components: Project Infra Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41935) Skip snapshot check and transfer progress log during publishing snapshots
[ https://issues.apache.org/jira/browse/SPARK-41935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-41935: -- Summary: Skip snapshot check and transfer progress log during publishing snapshots (was: Skip snapshot check and transfer progress log in release-build.sh) > Skip snapshot check and transfer progress log during publishing snapshots > - > > Key: SPARK-41935 > URL: https://issues.apache.org/jira/browse/SPARK-41935 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41934) Add the unsupported function list for `session`
[ https://issues.apache.org/jira/browse/SPARK-41934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41934: Assignee: (was: Apache Spark) > Add the unsupported function list for `session` > --- > > Key: SPARK-41934 > URL: https://issues.apache.org/jira/browse/SPARK-41934 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41934) Add the unsupported function list for `session`
[ https://issues.apache.org/jira/browse/SPARK-41934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41934: Assignee: Apache Spark > Add the unsupported function list for `session` > --- > > Key: SPARK-41934 > URL: https://issues.apache.org/jira/browse/SPARK-41934 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41934) Add the unsupported function list for `session`
[ https://issues.apache.org/jira/browse/SPARK-41934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41934: Assignee: (was: Apache Spark) > Add the unsupported function list for `session` > --- > > Key: SPARK-41934 > URL: https://issues.apache.org/jira/browse/SPARK-41934 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41934) Add the unsupported function list for `session`
[ https://issues.apache.org/jira/browse/SPARK-41934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655658#comment-17655658 ] Apache Spark commented on SPARK-41934: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39442 > Add the unsupported function list for `session` > --- > > Key: SPARK-41934 > URL: https://issues.apache.org/jira/browse/SPARK-41934 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41933) Provide local mode that automatically starts the server
[ https://issues.apache.org/jira/browse/SPARK-41933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41933: Assignee: (was: Apache Spark) > Provide local mode that automatically starts the server > --- > > Key: SPARK-41933 > URL: https://issues.apache.org/jira/browse/SPARK-41933 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently the Spark Connect server has to be started manually which is > troublesome for end users and developers to try Spark Connect out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41933) Provide local mode that automatically starts the server
[ https://issues.apache.org/jira/browse/SPARK-41933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41933: Assignee: Apache Spark > Provide local mode that automatically starts the server > --- > > Key: SPARK-41933 > URL: https://issues.apache.org/jira/browse/SPARK-41933 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Currently the Spark Connect server has to be started manually which is > troublesome for end users and developers to try Spark Connect out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41933) Provide local mode that automatically starts the server
[ https://issues.apache.org/jira/browse/SPARK-41933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655657#comment-17655657 ] Apache Spark commented on SPARK-41933: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39441 > Provide local mode that automatically starts the server > --- > > Key: SPARK-41933 > URL: https://issues.apache.org/jira/browse/SPARK-41933 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently the Spark Connect server has to be started manually which is > troublesome for end users and developers to try Spark Connect out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41934) Add the unsupported function list for `session`
Ruifeng Zheng created SPARK-41934: - Summary: Add the unsupported function list for `session` Key: SPARK-41934 URL: https://issues.apache.org/jira/browse/SPARK-41934 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41933) Provide local mode that automatically starts the server
Hyukjin Kwon created SPARK-41933: Summary: Provide local mode that automatically starts the server Key: SPARK-41933 URL: https://issues.apache.org/jira/browse/SPARK-41933 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Hyukjin Kwon Currently the Spark Connect server has to be started manually which is troublesome for end users and developers to try Spark Connect out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41932) Bootstrapping Spark Connect
Hyukjin Kwon created SPARK-41932: Summary: Bootstrapping Spark Connect Key: SPARK-41932 URL: https://issues.apache.org/jira/browse/SPARK-41932 Project: Spark Issue Type: Test Components: Connect Affects Versions: 3.4.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon We should: 1. Have an easy way to start the server. Like sbin/start-thriftserver 2. Provide an easy way to run the PySpark shell without manual server start. Like spark-sql script. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40451) Type annotations for Spark Connect Python client
[ https://issues.apache.org/jira/browse/SPARK-40451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655647#comment-17655647 ] Hyukjin Kwon commented on SPARK-40451: -- I believe this is done. > Type annotations for Spark Connect Python client > > > Key: SPARK-40451 > URL: https://issues.apache.org/jira/browse/SPARK-40451 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > > The mypy checks for the Spark Connect client have been disabled to make > quicker progress with the merge of the code. The goal for this task is to > address the failing checks and re-enable mypy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40451) Type annotations for Spark Connect Python client
[ https://issues.apache.org/jira/browse/SPARK-40451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40451. -- Assignee: Hyukjin Kwon Resolution: Done > Type annotations for Spark Connect Python client > > > Key: SPARK-40451 > URL: https://issues.apache.org/jira/browse/SPARK-40451 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Hyukjin Kwon >Priority: Major > > The mypy checks for the Spark Connect client have been disabled to make > quicker progress with the merge of the code. The goal for this task is to > address the failing checks and re-enable mypy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41927) Add the unsupported list for `GroupedData`
[ https://issues.apache.org/jira/browse/SPARK-41927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41927. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39437 [https://github.com/apache/spark/pull/39437] > Add the unsupported list for `GroupedData` > -- > > Key: SPARK-41927 > URL: https://issues.apache.org/jira/browse/SPARK-41927 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark
[ https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41824: Assignee: jiaan.geng > Implement DataFrame.explain format to be similar to PySpark > --- > > Key: SPARK-41824 > URL: https://issues.apache.org/jira/browse/SPARK-41824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: jiaan.geng >Priority: Major > > {code:java} > df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", > "name"]) > df.explain() > df.explain(True) > df.explain(mode="formatted") > df.explain("cost"){code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain() > Expected: > == Physical Plan == > *(1) Scan ExistingRDD[age...,name...] > Got: > == Physical Plan == > LocalTableScan [age#1148L, name#1149] > > > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain(mode="formatted") > Expected: > == Physical Plan == > * Scan ExistingRDD (...) > (1) Scan ExistingRDD [codegen id : ...] > Output [2]: [age..., name...] > ... > Got: > == Physical Plan == > LocalTableScan (1) > > > (1) LocalTableScan > Output [2]: [age#1170L, name#1171] > Arguments: [age#1170L, name#1171] > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark
[ https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41824. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39436 [https://github.com/apache/spark/pull/39436] > Implement DataFrame.explain format to be similar to PySpark > --- > > Key: SPARK-41824 > URL: https://issues.apache.org/jira/browse/SPARK-41824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.0 > > > {code:java} > df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", > "name"]) > df.explain() > df.explain(True) > df.explain(mode="formatted") > df.explain("cost"){code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain() > Expected: > == Physical Plan == > *(1) Scan ExistingRDD[age...,name...] > Got: > == Physical Plan == > LocalTableScan [age#1148L, name#1149] > > > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain(mode="formatted") > Expected: > == Physical Plan == > * Scan ExistingRDD (...) > (1) Scan ExistingRDD [codegen id : ...] > Output [2]: [age..., name...] > ... > Got: > == Physical Plan == > LocalTableScan (1) > > > (1) LocalTableScan > Output [2]: [age#1170L, name#1171] > Arguments: [age#1170L, name#1171] > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41927) Add the unsupported list for `GroupedData`
[ https://issues.apache.org/jira/browse/SPARK-41927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41927: Assignee: Ruifeng Zheng > Add the unsupported list for `GroupedData` > -- > > Key: SPARK-41927 > URL: https://issues.apache.org/jira/browse/SPARK-41927 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41928) Add the unsupported list for functions
[ https://issues.apache.org/jira/browse/SPARK-41928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41928. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39438 [https://github.com/apache/spark/pull/39438] > Add the unsupported list for functions > -- > > Key: SPARK-41928 > URL: https://issues.apache.org/jira/browse/SPARK-41928 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41928) Add the unsupported list for functions
[ https://issues.apache.org/jira/browse/SPARK-41928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41928: Assignee: Ruifeng Zheng > Add the unsupported list for functions > -- > > Key: SPARK-41928 > URL: https://issues.apache.org/jira/browse/SPARK-41928 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41929) Add function array_compact
[ https://issues.apache.org/jira/browse/SPARK-41929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41929: Assignee: Ruifeng Zheng > Add function array_compact > -- > > Key: SPARK-41929 > URL: https://issues.apache.org/jira/browse/SPARK-41929 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41930) Remove `branch-3.1` from publish_snapshot job
[ https://issues.apache.org/jira/browse/SPARK-41930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41930: - Assignee: Dongjoon Hyun > Remove `branch-3.1` from publish_snapshot job > - > > Key: SPARK-41930 > URL: https://issues.apache.org/jira/browse/SPARK-41930 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41930) Remove `branch-3.1` from publish_snapshot job
[ https://issues.apache.org/jira/browse/SPARK-41930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41930. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39440 [https://github.com/apache/spark/pull/39440] > Remove `branch-3.1` from publish_snapshot job > - > > Key: SPARK-41930 > URL: https://issues.apache.org/jira/browse/SPARK-41930 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41929) Add function array_compact
[ https://issues.apache.org/jira/browse/SPARK-41929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41929. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39439 [https://github.com/apache/spark/pull/39439] > Add function array_compact > -- > > Key: SPARK-41929 > URL: https://issues.apache.org/jira/browse/SPARK-41929 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41931) Improve UNSUPPORTED_DATA_TYPE message for complex types
Serge Rielau created SPARK-41931: Summary: Improve UNSUPPORTED_DATA_TYPE message for complex types Key: SPARK-41931 URL: https://issues.apache.org/jira/browse/SPARK-41931 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Serge Rielau spark-sql> SELECT CAST(array(1, 2, 3) AS ARRAY); [UNSUPPORTED_DATATYPE] Unsupported data type "ARRAY"(line 1, pos 30) == SQL == SELECT CAST(array(1, 2, 3) AS ARRAY) --^^^ This error message is confusing. We support ARRAY. We just require it to be typed. We should have an error like: [INCOMPLETE_TYPE_DEFINITION.ARRAY] The definition of type `ARRAY` is incomplete. You must provide an element type. For example: `ARRAY\`. Similarly for STRUCT and MAP. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41930) Remove `branch-3.1` from publish_snapshot job
[ https://issues.apache.org/jira/browse/SPARK-41930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41930: Assignee: Apache Spark > Remove `branch-3.1` from publish_snapshot job > - > > Key: SPARK-41930 > URL: https://issues.apache.org/jira/browse/SPARK-41930 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41930) Remove `branch-3.1` from publish_snapshot job
[ https://issues.apache.org/jira/browse/SPARK-41930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655636#comment-17655636 ] Apache Spark commented on SPARK-41930: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39440 > Remove `branch-3.1` from publish_snapshot job > - > > Key: SPARK-41930 > URL: https://issues.apache.org/jira/browse/SPARK-41930 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41930) Remove `branch-3.1` from publish_snapshot job
[ https://issues.apache.org/jira/browse/SPARK-41930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41930: Assignee: (was: Apache Spark) > Remove `branch-3.1` from publish_snapshot job > - > > Key: SPARK-41930 > URL: https://issues.apache.org/jira/browse/SPARK-41930 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41930) Remove `branch-3.1` from publish_snapshot job
Dongjoon Hyun created SPARK-41930: - Summary: Remove `branch-3.1` from publish_snapshot job Key: SPARK-41930 URL: https://issues.apache.org/jira/browse/SPARK-41930 Project: Spark Issue Type: Task Components: Project Infra Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41904) Fix Function `nth_value` functions output
[ https://issues.apache.org/jira/browse/SPARK-41904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655629#comment-17655629 ] jiaan.geng commented on SPARK-41904: [~techaddict]Could you tell me how to reproduce this issue? I want take a look! > Fix Function `nth_value` functions output > - > > Key: SPARK-41904 > URL: https://issues.apache.org/jira/browse/SPARK-41904 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > from pyspark.sql import Window > from pyspark.sql.functions import nth_value > df = self.spark.createDataFrame( > [ > ("a", 0, None), > ("a", 1, "x"), > ("a", 2, "y"), > ("a", 3, "z"), > ("a", 4, None), > ("b", 1, None), > ("b", 2, None), > ], > schema=("key", "order", "value"), > ) > w = Window.partitionBy("key").orderBy("order") > rs = df.select( > df.key, > df.order, > nth_value("value", 2).over(w), > nth_value("value", 2, False).over(w), > nth_value("value", 2, True).over(w), > ).collect() > expected = [ > ("a", 0, None, None, None), > ("a", 1, "x", "x", None), > ("a", 2, "x", "x", "y"), > ("a", 3, "x", "x", "y"), > ("a", 4, "x", "x", "y"), > ("b", 1, None, None, None), > ("b", 2, None, None, None), > ] > for r, ex in zip(sorted(rs), sorted(expected)): > self.assertEqual(tuple(r), ex[: len(r)]){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 755, in test_nth_value > self.assertEqual(tuple(r), ex[: len(r)]) > AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x') > First differing element 3: > None > 'x' > - ('a', 1, 'x', None) > ? > + ('a', 1, 'x', 'x') > ? ^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41929) Add function array_compact
[ https://issues.apache.org/jira/browse/SPARK-41929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655628#comment-17655628 ] Apache Spark commented on SPARK-41929: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39439 > Add function array_compact > -- > > Key: SPARK-41929 > URL: https://issues.apache.org/jira/browse/SPARK-41929 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41929) Add function array_compact
[ https://issues.apache.org/jira/browse/SPARK-41929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655627#comment-17655627 ] Apache Spark commented on SPARK-41929: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39439 > Add function array_compact > -- > > Key: SPARK-41929 > URL: https://issues.apache.org/jira/browse/SPARK-41929 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41929) Add function array_compact
[ https://issues.apache.org/jira/browse/SPARK-41929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41929: Assignee: (was: Apache Spark) > Add function array_compact > -- > > Key: SPARK-41929 > URL: https://issues.apache.org/jira/browse/SPARK-41929 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41929) Add function array_compact
[ https://issues.apache.org/jira/browse/SPARK-41929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41929: Assignee: Apache Spark > Add function array_compact > -- > > Key: SPARK-41929 > URL: https://issues.apache.org/jira/browse/SPARK-41929 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41929) Add function array_compact
Ruifeng Zheng created SPARK-41929: - Summary: Add function array_compact Key: SPARK-41929 URL: https://issues.apache.org/jira/browse/SPARK-41929 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41928) Add the unsupported list for functions
[ https://issues.apache.org/jira/browse/SPARK-41928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655626#comment-17655626 ] Apache Spark commented on SPARK-41928: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39438 > Add the unsupported list for functions > -- > > Key: SPARK-41928 > URL: https://issues.apache.org/jira/browse/SPARK-41928 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41928) Add the unsupported list for functions
[ https://issues.apache.org/jira/browse/SPARK-41928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41928: Assignee: (was: Apache Spark) > Add the unsupported list for functions > -- > > Key: SPARK-41928 > URL: https://issues.apache.org/jira/browse/SPARK-41928 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41928) Add the unsupported list for functions
[ https://issues.apache.org/jira/browse/SPARK-41928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41928: Assignee: Apache Spark > Add the unsupported list for functions > -- > > Key: SPARK-41928 > URL: https://issues.apache.org/jira/browse/SPARK-41928 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41928) Add the unsupported list for functions
[ https://issues.apache.org/jira/browse/SPARK-41928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655625#comment-17655625 ] Apache Spark commented on SPARK-41928: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39438 > Add the unsupported list for functions > -- > > Key: SPARK-41928 > URL: https://issues.apache.org/jira/browse/SPARK-41928 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41928) Add the unsupported list for functions
Ruifeng Zheng created SPARK-41928: - Summary: Add the unsupported list for functions Key: SPARK-41928 URL: https://issues.apache.org/jira/browse/SPARK-41928 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41927) Add the unsupported list for `GroupedData`
[ https://issues.apache.org/jira/browse/SPARK-41927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41927: Assignee: (was: Apache Spark) > Add the unsupported list for `GroupedData` > -- > > Key: SPARK-41927 > URL: https://issues.apache.org/jira/browse/SPARK-41927 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41927) Add the unsupported list for `GroupedData`
[ https://issues.apache.org/jira/browse/SPARK-41927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655624#comment-17655624 ] Apache Spark commented on SPARK-41927: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39437 > Add the unsupported list for `GroupedData` > -- > > Key: SPARK-41927 > URL: https://issues.apache.org/jira/browse/SPARK-41927 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41927) Add the unsupported list for `GroupedData`
[ https://issues.apache.org/jira/browse/SPARK-41927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41927: Assignee: Apache Spark > Add the unsupported list for `GroupedData` > -- > > Key: SPARK-41927 > URL: https://issues.apache.org/jira/browse/SPARK-41927 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark
[ https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41824: Assignee: (was: Apache Spark) > Implement DataFrame.explain format to be similar to PySpark > --- > > Key: SPARK-41824 > URL: https://issues.apache.org/jira/browse/SPARK-41824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", > "name"]) > df.explain() > df.explain(True) > df.explain(mode="formatted") > df.explain("cost"){code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain() > Expected: > == Physical Plan == > *(1) Scan ExistingRDD[age...,name...] > Got: > == Physical Plan == > LocalTableScan [age#1148L, name#1149] > > > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain(mode="formatted") > Expected: > == Physical Plan == > * Scan ExistingRDD (...) > (1) Scan ExistingRDD [codegen id : ...] > Output [2]: [age..., name...] > ... > Got: > == Physical Plan == > LocalTableScan (1) > > > (1) LocalTableScan > Output [2]: [age#1170L, name#1171] > Arguments: [age#1170L, name#1171] > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark
[ https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655623#comment-17655623 ] Apache Spark commented on SPARK-41824: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/39436 > Implement DataFrame.explain format to be similar to PySpark > --- > > Key: SPARK-41824 > URL: https://issues.apache.org/jira/browse/SPARK-41824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", > "name"]) > df.explain() > df.explain(True) > df.explain(mode="formatted") > df.explain("cost"){code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain() > Expected: > == Physical Plan == > *(1) Scan ExistingRDD[age...,name...] > Got: > == Physical Plan == > LocalTableScan [age#1148L, name#1149] > > > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain(mode="formatted") > Expected: > == Physical Plan == > * Scan ExistingRDD (...) > (1) Scan ExistingRDD [codegen id : ...] > Output [2]: [age..., name...] > ... > Got: > == Physical Plan == > LocalTableScan (1) > > > (1) LocalTableScan > Output [2]: [age#1170L, name#1171] > Arguments: [age#1170L, name#1171] > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark
[ https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41824: Assignee: Apache Spark > Implement DataFrame.explain format to be similar to PySpark > --- > > Key: SPARK-41824 > URL: https://issues.apache.org/jira/browse/SPARK-41824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > > {code:java} > df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", > "name"]) > df.explain() > df.explain(True) > df.explain(mode="formatted") > df.explain("cost"){code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain() > Expected: > == Physical Plan == > *(1) Scan ExistingRDD[age...,name...] > Got: > == Physical Plan == > LocalTableScan [age#1148L, name#1149] > > > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain(mode="formatted") > Expected: > == Physical Plan == > * Scan ExistingRDD (...) > (1) Scan ExistingRDD [codegen id : ...] > Output [2]: [age..., name...] > ... > Got: > == Physical Plan == > LocalTableScan (1) > > > (1) LocalTableScan > Output [2]: [age#1170L, name#1171] > Arguments: [age#1170L, name#1171] > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41927) Add the unsupported list for `GroupedData`
Ruifeng Zheng created SPARK-41927: - Summary: Add the unsupported list for `GroupedData` Key: SPARK-41927 URL: https://issues.apache.org/jira/browse/SPARK-41927 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41926) Add Github action test job with RocksDB as UI backend
[ https://issues.apache.org/jira/browse/SPARK-41926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41926: Assignee: Gengliang Wang (was: Apache Spark) > Add Github action test job with RocksDB as UI backend > - > > Key: SPARK-41926 > URL: https://issues.apache.org/jira/browse/SPARK-41926 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41926) Add Github action test job with RocksDB as UI backend
[ https://issues.apache.org/jira/browse/SPARK-41926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41926: Assignee: Apache Spark (was: Gengliang Wang) > Add Github action test job with RocksDB as UI backend > - > > Key: SPARK-41926 > URL: https://issues.apache.org/jira/browse/SPARK-41926 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41926) Add Github action test job with RocksDB as UI backend
[ https://issues.apache.org/jira/browse/SPARK-41926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655622#comment-17655622 ] Apache Spark commented on SPARK-41926: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/39435 > Add Github action test job with RocksDB as UI backend > - > > Key: SPARK-41926 > URL: https://issues.apache.org/jira/browse/SPARK-41926 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41926) Add Github action test job with RocksDB as UI backend
Gengliang Wang created SPARK-41926: -- Summary: Add Github action test job with RocksDB as UI backend Key: SPARK-41926 URL: https://issues.apache.org/jira/browse/SPARK-41926 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.4.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41875) Throw proper errors in Dataset.to()
[ https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41875. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39422 [https://github.com/apache/spark/pull/39422] > Throw proper errors in Dataset.to() > --- > > Key: SPARK-41875 > URL: https://issues.apache.org/jira/browse/SPARK-41875 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.0 > > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41875) Throw proper errors in Dataset.to()
[ https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41875: - Assignee: jiaan.geng > Throw proper errors in Dataset.to() > --- > > Key: SPARK-41875 > URL: https://issues.apache.org/jira/browse/SPARK-41875 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: jiaan.geng >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument
[ https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41898. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39433 [https://github.com/apache/spark/pull/39433] > Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as > argument > > > Key: SPARK-41898 > URL: https://issues.apache.org/jira/browse/SPARK-41898 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], > ["key", "value"]) > w = Window.partitionBy("value").orderBy("key") > from pyspark.sql import functions as F > sel = df.select( > df.value, > df.key, > F.max("key").over(w.rowsBetween(0, 1)), > F.min("key").over(w.rowsBetween(0, 1)), > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), > F.row_number().over(w), > F.rank().over(w), > F.dense_rank().over(w), > F.ntile(2).over(w), > ) > rs = sorted(sel.collect()){code} > {code:java} > Traceback (most recent call last): File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 821, in test_window_functions > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", > line 152, in rowsBetween raise TypeError(f"start must be a int, but got > {type(start).__name__}") TypeError: start must be a int, but got float {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument
[ https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41898: - Assignee: Sandeep Singh > Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as > argument > > > Key: SPARK-41898 > URL: https://issues.apache.org/jira/browse/SPARK-41898 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], > ["key", "value"]) > w = Window.partitionBy("value").orderBy("key") > from pyspark.sql import functions as F > sel = df.select( > df.value, > df.key, > F.max("key").over(w.rowsBetween(0, 1)), > F.min("key").over(w.rowsBetween(0, 1)), > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), > F.row_number().over(w), > F.rank().over(w), > F.dense_rank().over(w), > F.ntile(2).over(w), > ) > rs = sorted(sel.collect()){code} > {code:java} > Traceback (most recent call last): File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 821, in test_window_functions > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", > line 152, in rowsBetween raise TypeError(f"start must be a int, but got > {type(start).__name__}") TypeError: start must be a int, but got float {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41924) Make StructType support metadata and Implement `DataFrame.withMetadata`
[ https://issues.apache.org/jira/browse/SPARK-41924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41924: - Assignee: Ruifeng Zheng > Make StructType support metadata and Implement `DataFrame.withMetadata` > --- > > Key: SPARK-41924 > URL: https://issues.apache.org/jira/browse/SPARK-41924 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41924) Make StructType support metadata and Implement `DataFrame.withMetadata`
[ https://issues.apache.org/jira/browse/SPARK-41924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41924. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39432 [https://github.com/apache/spark/pull/39432] > Make StructType support metadata and Implement `DataFrame.withMetadata` > --- > > Key: SPARK-41924 > URL: https://issues.apache.org/jira/browse/SPARK-41924 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35743) Improve Parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-35743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655604#comment-17655604 ] Dongjoon Hyun commented on SPARK-35743: --- Thank you for updating! > Improve Parquet vectorized reader > - > > Key: SPARK-35743 > URL: https://issues.apache.org/jira/browse/SPARK-35743 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: parquet, releasenotes > Fix For: 3.4.0 > > > This umbrella JIRA tracks efforts to improve vectorized Parquet reader. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35743) Improve Parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-35743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35743: -- Labels: parquet releasenotes (was: parquet) > Improve Parquet vectorized reader > - > > Key: SPARK-35743 > URL: https://issues.apache.org/jira/browse/SPARK-35743 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: parquet, releasenotes > Fix For: 3.4.0 > > > This umbrella JIRA tracks efforts to improve vectorized Parquet reader. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36529: - Parent: (was: SPARK-35743) Issue Type: Bug (was: Sub-task) > Decouple CPU with IO work in vectorized Parquet reader > -- > > Key: SPARK-36529 > URL: https://issues.apache.org/jira/browse/SPARK-36529 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently it seems the vectorized Parquet reader does almost everything in a > sequential manner: > 1. read the row group using file system API (perhaps from remote storage like > S3) > 2. allocate buffers and store those row group bytes into them > 3. decompress the data pages > 4. in Spark, decode all the read columns one by one > 5. read the next row group and repeat from 1. > A lot of improvements can be done to decouple the IO and CPU intensive work. > In addition, we could parallelize the row group loading and column decoding, > and utilizing all the cores available for a Spark task. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36529: - Issue Type: Improvement (was: Bug) > Decouple CPU with IO work in vectorized Parquet reader > -- > > Key: SPARK-36529 > URL: https://issues.apache.org/jira/browse/SPARK-36529 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently it seems the vectorized Parquet reader does almost everything in a > sequential manner: > 1. read the row group using file system API (perhaps from remote storage like > S3) > 2. allocate buffers and store those row group bytes into them > 3. decompress the data pages > 4. in Spark, decode all the read columns one by one > 5. read the next row group and repeat from 1. > A lot of improvements can be done to decouple the IO and CPU intensive work. > In addition, we could parallelize the row group loading and column decoding, > and utilizing all the cores available for a Spark task. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
[ https://issues.apache.org/jira/browse/SPARK-41925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41925. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39434 [https://github.com/apache/spark/pull/39434] > Enable spark.sql.orc.enableNestedColumnVectorizedReader by default > -- > > Key: SPARK-41925 > URL: https://issues.apache.org/jira/browse/SPARK-41925 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > > Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims > to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36528) Implement lazy decoding for the vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-36528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36528: - Parent: (was: SPARK-35743) Issue Type: Bug (was: Sub-task) > Implement lazy decoding for the vectorized Parquet reader > - > > Key: SPARK-36528 > URL: https://issues.apache.org/jira/browse/SPARK-36528 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector > and then operate on the decoded data. However, it may be more efficient to > directly operate on encoded data, for instance, performing filter or > aggregation on RLE-encoded data, or performing comparison over > dictionary-encoded string data. This can also potentially work with encodings > in Parquet v2 format, such as DELTA_BYTE_ARRAY. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36528) Implement lazy decoding for the vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-36528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36528: - Issue Type: New Feature (was: Bug) > Implement lazy decoding for the vectorized Parquet reader > - > > Key: SPARK-36528 > URL: https://issues.apache.org/jira/browse/SPARK-36528 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector > and then operate on the decoded data. However, it may be more efficient to > directly operate on encoded data, for instance, performing filter or > aggregation on RLE-encoded data, or performing comparison over > dictionary-encoded string data. This can also potentially work with encodings > in Parquet v2 format, such as DELTA_BYTE_ARRAY. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
[ https://issues.apache.org/jira/browse/SPARK-41925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41925: - Assignee: Dongjoon Hyun > Enable spark.sql.orc.enableNestedColumnVectorizedReader by default > -- > > Key: SPARK-41925 > URL: https://issues.apache.org/jira/browse/SPARK-41925 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims > to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35743) Improve Parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-35743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-35743. -- Fix Version/s: 3.4.0 Resolution: Fixed > Improve Parquet vectorized reader > - > > Key: SPARK-35743 > URL: https://issues.apache.org/jira/browse/SPARK-35743 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: parquet > Fix For: 3.4.0 > > > This umbrella JIRA tracks efforts to improve vectorized Parquet reader. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36527) Implement lazy materialization for the vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-36527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36527: - Parent: (was: SPARK-35743) Issue Type: Improvement (was: Sub-task) > Implement lazy materialization for the vectorized Parquet reader > > > Key: SPARK-36527 > URL: https://issues.apache.org/jira/browse/SPARK-36527 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > At the moment the Parquet vectorized reader will eagerly decode all the > columns that are in the read schema, before any filter has been applied to > them. This is costly. Instead it's better to only materialize these column > vectors when the data are actually needed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35743) Improve Parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-35743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655560#comment-17655560 ] Dongjoon Hyun commented on SPARK-35743: --- Hi, [~csun] . Please update the JIRA's target version and label field if you want to have this at 3.4.0. > Improve Parquet vectorized reader > - > > Key: SPARK-35743 > URL: https://issues.apache.org/jira/browse/SPARK-35743 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: parquet > > This umbrella JIRA tracks efforts to improve vectorized Parquet reader. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35743) Improve Parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-35743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35743: -- Target Version/s: (was: 3.3.0) > Improve Parquet vectorized reader > - > > Key: SPARK-35743 > URL: https://issues.apache.org/jira/browse/SPARK-35743 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: parquet > > This umbrella JIRA tracks efforts to improve vectorized Parquet reader. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41895) Add tests for streaming UI with RocksDB backend
[ https://issues.apache.org/jira/browse/SPARK-41895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-41895. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39415 [https://github.com/apache/spark/pull/39415] > Add tests for streaming UI with RocksDB backend > --- > > Key: SPARK-41895 > URL: https://issues.apache.org/jira/browse/SPARK-41895 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
[ https://issues.apache.org/jira/browse/SPARK-41925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1767#comment-1767 ] Apache Spark commented on SPARK-41925: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39434 > Enable spark.sql.orc.enableNestedColumnVectorizedReader by default > -- > > Key: SPARK-41925 > URL: https://issues.apache.org/jira/browse/SPARK-41925 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims > to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
[ https://issues.apache.org/jira/browse/SPARK-41925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1766#comment-1766 ] Apache Spark commented on SPARK-41925: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39434 > Enable spark.sql.orc.enableNestedColumnVectorizedReader by default > -- > > Key: SPARK-41925 > URL: https://issues.apache.org/jira/browse/SPARK-41925 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims > to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
[ https://issues.apache.org/jira/browse/SPARK-41925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41925: Assignee: (was: Apache Spark) > Enable spark.sql.orc.enableNestedColumnVectorizedReader by default > -- > > Key: SPARK-41925 > URL: https://issues.apache.org/jira/browse/SPARK-41925 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims > to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
[ https://issues.apache.org/jira/browse/SPARK-41925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41925: Assignee: Apache Spark > Enable spark.sql.orc.enableNestedColumnVectorizedReader by default > -- > > Key: SPARK-41925 > URL: https://issues.apache.org/jira/browse/SPARK-41925 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims > to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41925) Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
Dongjoon Hyun created SPARK-41925: - Summary: Enable spark.sql.orc.enableNestedColumnVectorizedReader by default Key: SPARK-41925 URL: https://issues.apache.org/jira/browse/SPARK-41925 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Dongjoon Hyun Like `spark.sql.parquet.enableNestedColumnVectorizedReader`, this issue aims to enable `spark.sql.orc.enableNestedColumnVectorizedReader` by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41918) Refine the naming in proto messages
[ https://issues.apache.org/jira/browse/SPARK-41918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1764#comment-1764 ] Rui Wang commented on SPARK-41918: -- I did some tests locally and find something as the below: If I rename a field, of course the code that access the field must be updated. Then in terms of backwards compatibility, the client uses old named field can talk to the server uses the new named field without a problem. also in terms of forwards compatibility, it works nicely. So now probably I know it better: renaming fields only require to recompile the code after that binaries are supposed to work as before. > Refine the naming in proto messages > --- > > Key: SPARK-41918 > URL: https://issues.apache.org/jira/browse/SPARK-41918 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > normally, we name the fields after the corresponding LogiclalPlan or > DataFrame API, but they are not consistent in protos, for example, the column > name: > {code:java} > message UnresolvedRegex { > // (Required) The column name used to extract column with regex. > string col_name = 1; > } > {code} > {code:java} > message Alias { > // (Required) The expression that alias will be added on. > Expression expr = 1; > // (Required) a list of name parts for the alias. > // > // Scalar columns only has one name that presents. > repeated string name = 2; > // (Optional) Alias metadata expressed as a JSON map. > optional string metadata = 3; > } > {code} > {code:java} > // Relation of type [[Deduplicate]] which have duplicate rows removed, could > consider either only > // the subset of columns or all the columns. > message Deduplicate { > // (Required) Input relation for a Deduplicate. > Relation input = 1; > // (Optional) Deduplicate based on a list of column names. > // > // This field does not co-use with `all_columns_as_keys`. > repeated string column_names = 2; > // (Optional) Deduplicate based on all the columns of the input relation. > // > // This field does not co-use with `column_names`. > optional bool all_columns_as_keys = 3; > } > {code} > {code:java} > // Computes basic statistics for numeric and string columns, including count, > mean, stddev, min, > // and max. If no columns are given, this function computes statistics for > all numerical or > // string columns. > message StatDescribe { > // (Required) The input relation. > Relation input = 1; > // (Optional) Columns to compute statistics on. > repeated string cols = 2; > } > {code} > we probably should unify the naming: > single column -> `column` > multi columns -> `columns` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41820) DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement failed
[ https://issues.apache.org/jira/browse/SPARK-41820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41820: -- Description: {code:java} >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", >>> "name"]) >>> df.createOrReplaceGlobalTempView("people") {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1292, in pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView Failed example: df2.createOrReplaceGlobalTempView("people") Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df2.createOrReplaceGlobalTempView("people") File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1192, in createOrReplaceGlobalTempView self._session.client.execute_command(command) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 459, in execute_command self._execute(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 547, in _execute self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 625, in _handle_error raise SparkConnectException(status.message) from None pyspark.sql.connect.client.SparkConnectException: requirement failed {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1292, in pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView Failed example: df2.createOrReplaceGlobalTempView("people") Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df2.createOrReplaceGlobalTempView("people") File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1192, in createOrReplaceGlobalTempView self._session.client.execute_command(command) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 459, in execute_command self._execute(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 547, in _execute self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 625, in _handle_error raise SparkConnectException(status.message) from None pyspark.sql.connect.client.SparkConnectException: requirement failed {code} > DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement > failed > --- > > Key: SPARK-41820 > URL: https://issues.apache.org/jira/browse/SPARK-41820 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", > >>> "name"]) > >>> df.createOrReplaceGlobalTempView("people") {code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1292, in > pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView > Failed example: > df2.createOrReplaceGlobalTempView("people") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView[3]>", > line 1, in > df2.createOrReplaceGlobalTempView("people") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1192, in createOrReplaceGlobalTempView > self._session.client.execute_command(command) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 625, in _handle_error > raise SparkConnectException(statu
[jira] [Comment Edited] (SPARK-41918) Refine the naming in proto messages
[ https://issues.apache.org/jira/browse/SPARK-41918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655545#comment-17655545 ] Rui Wang edited comment on SPARK-41918 at 1/6/23 6:35 PM: -- [~grundprinzip-db] I am a bit confused on the renaming and what compatibility it offers: ``` message Foo { int a = 1; } ``` On the receiver side it access the a val t = foo.a + 1 then we allow rename field ``` message Foo { int b = 1; } ``` Any renaming will break the receiver side's code? Do I misunderstand `WIRE compatibility` that the receiver should be able to read the output after the wire? was (Author: amaliujia): [~grundprinzip-db] I am a bit confused on the renaming and what compatibility it offers: ``` message Foo { int a = 1; } ``` On the receiver side it access the a val t = foo.a + 1 Any renaming will break the receiver side's code? Do I misunderstand `WIRE compatibility` that the receiver should be able to read the output after the wire? > Refine the naming in proto messages > --- > > Key: SPARK-41918 > URL: https://issues.apache.org/jira/browse/SPARK-41918 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > normally, we name the fields after the corresponding LogiclalPlan or > DataFrame API, but they are not consistent in protos, for example, the column > name: > {code:java} > message UnresolvedRegex { > // (Required) The column name used to extract column with regex. > string col_name = 1; > } > {code} > {code:java} > message Alias { > // (Required) The expression that alias will be added on. > Expression expr = 1; > // (Required) a list of name parts for the alias. > // > // Scalar columns only has one name that presents. > repeated string name = 2; > // (Optional) Alias metadata expressed as a JSON map. > optional string metadata = 3; > } > {code} > {code:java} > // Relation of type [[Deduplicate]] which have duplicate rows removed, could > consider either only > // the subset of columns or all the columns. > message Deduplicate { > // (Required) Input relation for a Deduplicate. > Relation input = 1; > // (Optional) Deduplicate based on a list of column names. > // > // This field does not co-use with `all_columns_as_keys`. > repeated string column_names = 2; > // (Optional) Deduplicate based on all the columns of the input relation. > // > // This field does not co-use with `column_names`. > optional bool all_columns_as_keys = 3; > } > {code} > {code:java} > // Computes basic statistics for numeric and string columns, including count, > mean, stddev, min, > // and max. If no columns are given, this function computes statistics for > all numerical or > // string columns. > message StatDescribe { > // (Required) The input relation. > Relation input = 1; > // (Optional) Columns to compute statistics on. > repeated string cols = 2; > } > {code} > we probably should unify the naming: > single column -> `column` > multi columns -> `columns` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41918) Refine the naming in proto messages
[ https://issues.apache.org/jira/browse/SPARK-41918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655545#comment-17655545 ] Rui Wang commented on SPARK-41918: -- [~grundprinzip-db] I am a bit confused on the renaming and what compatibility it offers: ``` message Foo { int a = 1; } ``` On the receiver side it access the a val t = foo.a + 1 Any renaming will break the receiver side's code? Do I misunderstand `WIRE compatibility` that the receiver should be able to read the output after the wire? > Refine the naming in proto messages > --- > > Key: SPARK-41918 > URL: https://issues.apache.org/jira/browse/SPARK-41918 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > normally, we name the fields after the corresponding LogiclalPlan or > DataFrame API, but they are not consistent in protos, for example, the column > name: > {code:java} > message UnresolvedRegex { > // (Required) The column name used to extract column with regex. > string col_name = 1; > } > {code} > {code:java} > message Alias { > // (Required) The expression that alias will be added on. > Expression expr = 1; > // (Required) a list of name parts for the alias. > // > // Scalar columns only has one name that presents. > repeated string name = 2; > // (Optional) Alias metadata expressed as a JSON map. > optional string metadata = 3; > } > {code} > {code:java} > // Relation of type [[Deduplicate]] which have duplicate rows removed, could > consider either only > // the subset of columns or all the columns. > message Deduplicate { > // (Required) Input relation for a Deduplicate. > Relation input = 1; > // (Optional) Deduplicate based on a list of column names. > // > // This field does not co-use with `all_columns_as_keys`. > repeated string column_names = 2; > // (Optional) Deduplicate based on all the columns of the input relation. > // > // This field does not co-use with `column_names`. > optional bool all_columns_as_keys = 3; > } > {code} > {code:java} > // Computes basic statistics for numeric and string columns, including count, > mean, stddev, min, > // and max. If no columns are given, this function computes statistics for > all numerical or > // string columns. > message StatDescribe { > // (Required) The input relation. > Relation input = 1; > // (Optional) Columns to compute statistics on. > repeated string cols = 2; > } > {code} > we probably should unify the naming: > single column -> `column` > multi columns -> `columns` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument
[ https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41898: Assignee: (was: Apache Spark) > Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as > argument > > > Key: SPARK-41898 > URL: https://issues.apache.org/jira/browse/SPARK-41898 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], > ["key", "value"]) > w = Window.partitionBy("value").orderBy("key") > from pyspark.sql import functions as F > sel = df.select( > df.value, > df.key, > F.max("key").over(w.rowsBetween(0, 1)), > F.min("key").over(w.rowsBetween(0, 1)), > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), > F.row_number().over(w), > F.rank().over(w), > F.dense_rank().over(w), > F.ntile(2).over(w), > ) > rs = sorted(sel.collect()){code} > {code:java} > Traceback (most recent call last): File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 821, in test_window_functions > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", > line 152, in rowsBetween raise TypeError(f"start must be a int, but got > {type(start).__name__}") TypeError: start must be a int, but got float {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument
[ https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41898: Assignee: Apache Spark > Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as > argument > > > Key: SPARK-41898 > URL: https://issues.apache.org/jira/browse/SPARK-41898 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > > {code:java} > df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], > ["key", "value"]) > w = Window.partitionBy("value").orderBy("key") > from pyspark.sql import functions as F > sel = df.select( > df.value, > df.key, > F.max("key").over(w.rowsBetween(0, 1)), > F.min("key").over(w.rowsBetween(0, 1)), > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), > F.row_number().over(w), > F.rank().over(w), > F.dense_rank().over(w), > F.ntile(2).over(w), > ) > rs = sorted(sel.collect()){code} > {code:java} > Traceback (most recent call last): File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 821, in test_window_functions > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", > line 152, in rowsBetween raise TypeError(f"start must be a int, but got > {type(start).__name__}") TypeError: start must be a int, but got float {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument
[ https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655539#comment-17655539 ] Apache Spark commented on SPARK-41898: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39433 > Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as > argument > > > Key: SPARK-41898 > URL: https://issues.apache.org/jira/browse/SPARK-41898 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], > ["key", "value"]) > w = Window.partitionBy("value").orderBy("key") > from pyspark.sql import functions as F > sel = df.select( > df.value, > df.key, > F.max("key").over(w.rowsBetween(0, 1)), > F.min("key").over(w.rowsBetween(0, 1)), > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), > F.row_number().over(w), > F.rank().over(w), > F.dense_rank().over(w), > F.ntile(2).over(w), > ) > rs = sorted(sel.collect()){code} > {code:java} > Traceback (most recent call last): File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 821, in test_window_functions > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", > line 152, in rowsBetween raise TypeError(f"start must be a int, but got > {type(start).__name__}") TypeError: start must be a int, but got float {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41911) Add version fields to Connect proto
[ https://issues.apache.org/jira/browse/SPARK-41911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655537#comment-17655537 ] Rui Wang commented on SPARK-41911: -- We will know better where we need the versions during the process of defining the compatibility requirement. > Add version fields to Connect proto > --- > > Key: SPARK-41911 > URL: https://issues.apache.org/jira/browse/SPARK-41911 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > > We may need this to help maintain compatibility. Depending on the concrete > protocol design, we may use field number 1 for version fields thus may cause > breaking changes on existing proto messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions
[ https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655532#comment-17655532 ] L. C. Hsieh commented on SPARK-41049: - For a correctness bug, I think we should backport it, though the patch is a kind of refactoring work. > Nondeterministic expressions have unstable values if they are children of > CodegenFallback expressions > - > > Key: SPARK-41049 > URL: https://issues.apache.org/jira/browse/SPARK-41049 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Guy Boo >Assignee: Wenchen Fan >Priority: Major > Labels: correctness > Fix For: 3.4.0 > > > h2. Expectation > For a given row, Nondeterministic expressions are expected to have stable > values. > {code:scala} > import org.apache.spark.sql.functions._ > val df = sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > df.select(v1, v1).collect{code} > Returns a set like this: > |8777|8777| > |1357|1357| > |3435|3435| > |9204|9204| > |3870|3870| > where both columns always have the same value, but what that value is changes > from row to row. This is different from the following: > {code:scala} > df.select(rand(), rand()).collect{code} > In this case, because the rand() calls are distinct, the values in both > columns should be different. > h2. Problem > This expectation does not appear to be stable in the event that any > subsequent expression is a CodegenFallback. This program: > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > val sparkSession = SparkSession.builder().getOrCreate() > val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback > df.select(v1, v1, v2, v2).collect {code} > produces output like this: > |8159|8159|8159|{color:#ff}2028{color}| > |8320|8320|8320|{color:#ff}1640{color}| > |7937|7937|7937|{color:#ff}769{color}| > |436|436|436|{color:#ff}8924{color}| > |8924|8924|2827|{color:#ff}2731{color}| > Not sure why the first call via the CodegenFallback path should be correct > while subsequent calls aren't. > h2. Workaround > If the Nondeterministic expression is moved to a separate, earlier select() > call, so the CodegenFallback instead only refers to a column reference, then > the problem seems to go away. But this workaround may not be reliable if > optimization is ever able to restructure adjacent select()s. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions
[ https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655516#comment-17655516 ] Erik Krogen commented on SPARK-41049: - Thanks! [~cloud_fan] [~viirya] shall we backport this to branch-3.3 and branch-3.2, given it is a correctness bug? > Nondeterministic expressions have unstable values if they are children of > CodegenFallback expressions > - > > Key: SPARK-41049 > URL: https://issues.apache.org/jira/browse/SPARK-41049 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Guy Boo >Assignee: Wenchen Fan >Priority: Major > Labels: correctness > Fix For: 3.4.0 > > > h2. Expectation > For a given row, Nondeterministic expressions are expected to have stable > values. > {code:scala} > import org.apache.spark.sql.functions._ > val df = sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > df.select(v1, v1).collect{code} > Returns a set like this: > |8777|8777| > |1357|1357| > |3435|3435| > |9204|9204| > |3870|3870| > where both columns always have the same value, but what that value is changes > from row to row. This is different from the following: > {code:scala} > df.select(rand(), rand()).collect{code} > In this case, because the rand() calls are distinct, the values in both > columns should be different. > h2. Problem > This expectation does not appear to be stable in the event that any > subsequent expression is a CodegenFallback. This program: > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > val sparkSession = SparkSession.builder().getOrCreate() > val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback > df.select(v1, v1, v2, v2).collect {code} > produces output like this: > |8159|8159|8159|{color:#ff}2028{color}| > |8320|8320|8320|{color:#ff}1640{color}| > |7937|7937|7937|{color:#ff}769{color}| > |436|436|436|{color:#ff}8924{color}| > |8924|8924|2827|{color:#ff}2731{color}| > Not sure why the first call via the CodegenFallback path should be correct > while subsequent calls aren't. > h2. Workaround > If the Nondeterministic expression is moved to a separate, earlier select() > call, so the CodegenFallback instead only refers to a column reference, then > the problem seems to go away. But this workaround may not be reliable if > optimization is ever able to restructure adjacent select()s. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41890) Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-41890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-41890. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39399 [https://github.com/apache/spark/pull/39399] > Reduce `toSeq` in > `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for > Scala 2.13 > -- > > Key: SPARK-41890 > URL: https://issues.apache.org/jira/browse/SPARK-41890 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL, Web UI >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > Similar work to SPARK-41709 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41800) Upgrade commons-compress to 1.22
[ https://issues.apache.org/jira/browse/SPARK-41800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-41800: Assignee: BingKun Pan > Upgrade commons-compress to 1.22 > > > Key: SPARK-41800 > URL: https://issues.apache.org/jira/browse/SPARK-41800 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41800) Upgrade commons-compress to 1.22
[ https://issues.apache.org/jira/browse/SPARK-41800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-41800. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39326 [https://github.com/apache/spark/pull/39326] > Upgrade commons-compress to 1.22 > > > Key: SPARK-41800 > URL: https://issues.apache.org/jira/browse/SPARK-41800 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41890) Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-41890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-41890: Assignee: Yang Jie > Reduce `toSeq` in > `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for > Scala 2.13 > -- > > Key: SPARK-41890 > URL: https://issues.apache.org/jira/browse/SPARK-41890 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL, Web UI >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > Similar work to SPARK-41709 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark
[ https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655442#comment-17655442 ] Hyukjin Kwon commented on SPARK-41824: -- It's actually implementation details in PySpark. It would be difficult to make it matched. Let's either fix the test to be compatiable for both cases, or simply skip it with {{doctest: +SKIP}} > Implement DataFrame.explain format to be similar to PySpark > --- > > Key: SPARK-41824 > URL: https://issues.apache.org/jira/browse/SPARK-41824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", > "name"]) > df.explain() > df.explain(True) > df.explain(mode="formatted") > df.explain("cost"){code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain() > Expected: > == Physical Plan == > *(1) Scan ExistingRDD[age...,name...] > Got: > == Physical Plan == > LocalTableScan [age#1148L, name#1149] > > > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain(mode="formatted") > Expected: > == Physical Plan == > * Scan ExistingRDD (...) > (1) Scan ExistingRDD [codegen id : ...] > Output [2]: [age..., name...] > ... > Got: > == Physical Plan == > LocalTableScan (1) > > > (1) LocalTableScan > Output [2]: [age#1170L, name#1171] > Arguments: [age#1170L, name#1171] > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41924) Make StructType support metadata and Implement `DataFrame.withMetadata`
[ https://issues.apache.org/jira/browse/SPARK-41924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41924: Assignee: Apache Spark > Make StructType support metadata and Implement `DataFrame.withMetadata` > --- > > Key: SPARK-41924 > URL: https://issues.apache.org/jira/browse/SPARK-41924 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41924) Make StructType support metadata and Implement `DataFrame.withMetadata`
[ https://issues.apache.org/jira/browse/SPARK-41924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41924: Assignee: (was: Apache Spark) > Make StructType support metadata and Implement `DataFrame.withMetadata` > --- > > Key: SPARK-41924 > URL: https://issues.apache.org/jira/browse/SPARK-41924 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41924) Make StructType support metadata and Implement `DataFrame.withMetadata`
[ https://issues.apache.org/jira/browse/SPARK-41924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655436#comment-17655436 ] Apache Spark commented on SPARK-41924: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39432 > Make StructType support metadata and Implement `DataFrame.withMetadata` > --- > > Key: SPARK-41924 > URL: https://issues.apache.org/jira/browse/SPARK-41924 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41924) Make StructType support metadata and Implement `DataFrame.withMetadata`
Ruifeng Zheng created SPARK-41924: - Summary: Make StructType support metadata and Implement `DataFrame.withMetadata` Key: SPARK-41924 URL: https://issues.apache.org/jira/browse/SPARK-41924 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark
[ https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655434#comment-17655434 ] jiaan.geng edited comment on SPARK-41824 at 1/6/23 12:56 PM: - I found scala API Dataset.explain print the same output as connect.dataframe. {code:java} == Physical Plan == *(1) Project [_1#x AS age#x, _2#x AS name#x] +- *(1) LocalTableScan [_1#x, _2#x] {code} So, do we need follow the behavior of pyspark or scala API ? was (Author: beliefer): I found scala API Dataset.explain print the same output as connect.dataframe. {code:java} == Physical Plan == *(1) Project [_1#x AS age#x, _2#x AS name#x] +- *(1) LocalTableScan [_1#x, _2#x] {code} > Implement DataFrame.explain format to be similar to PySpark > --- > > Key: SPARK-41824 > URL: https://issues.apache.org/jira/browse/SPARK-41824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", > "name"]) > df.explain() > df.explain(True) > df.explain(mode="formatted") > df.explain("cost"){code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain() > Expected: > == Physical Plan == > *(1) Scan ExistingRDD[age...,name...] > Got: > == Physical Plan == > LocalTableScan [age#1148L, name#1149] > > > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain(mode="formatted") > Expected: > == Physical Plan == > * Scan ExistingRDD (...) > (1) Scan ExistingRDD [codegen id : ...] > Output [2]: [age..., name...] > ... > Got: > == Physical Plan == > LocalTableScan (1) > > > (1) LocalTableScan > Output [2]: [age#1170L, name#1171] > Arguments: [age#1170L, name#1171] > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark
[ https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655434#comment-17655434 ] jiaan.geng edited comment on SPARK-41824 at 1/6/23 12:56 PM: - I found scala API Dataset.explain print the same output as connect.dataframe. {code:java} == Physical Plan == *(1) Project [_1#x AS age#x, _2#x AS name#x] +- *(1) LocalTableScan [_1#x, _2#x] {code} So, do we need follow the behavior of pyspark or scala API ? cc [~gurwls223] was (Author: beliefer): I found scala API Dataset.explain print the same output as connect.dataframe. {code:java} == Physical Plan == *(1) Project [_1#x AS age#x, _2#x AS name#x] +- *(1) LocalTableScan [_1#x, _2#x] {code} So, do we need follow the behavior of pyspark or scala API ? > Implement DataFrame.explain format to be similar to PySpark > --- > > Key: SPARK-41824 > URL: https://issues.apache.org/jira/browse/SPARK-41824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", > "name"]) > df.explain() > df.explain(True) > df.explain(mode="formatted") > df.explain("cost"){code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain() > Expected: > == Physical Plan == > *(1) Scan ExistingRDD[age...,name...] > Got: > == Physical Plan == > LocalTableScan [age#1148L, name#1149] > > > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain(mode="formatted") > Expected: > == Physical Plan == > * Scan ExistingRDD (...) > (1) Scan ExistingRDD [codegen id : ...] > Output [2]: [age..., name...] > ... > Got: > == Physical Plan == > LocalTableScan (1) > > > (1) LocalTableScan > Output [2]: [age#1170L, name#1171] > Arguments: [age#1170L, name#1171] > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org