[jira] [Commented] (SPARK-40016) Remove unnecessary TryEval in TrySum
[ https://issues.apache.org/jira/browse/SPARK-40016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577203#comment-17577203 ] Apache Spark commented on SPARK-40016: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/37446 > Remove unnecessary TryEval in TrySum > > > Key: SPARK-40016 > URL: https://issues.apache.org/jira/browse/SPARK-40016 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > Remove unnecessary TryEval in TrySum for simplicity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40016) Remove unnecessary TryEval in TrySum
[ https://issues.apache.org/jira/browse/SPARK-40016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40016: Assignee: Gengliang Wang (was: Apache Spark) > Remove unnecessary TryEval in TrySum > > > Key: SPARK-40016 > URL: https://issues.apache.org/jira/browse/SPARK-40016 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > Remove unnecessary TryEval in TrySum for simplicity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40016) Remove unnecessary TryEval in TrySum
[ https://issues.apache.org/jira/browse/SPARK-40016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40016: Assignee: Apache Spark (was: Gengliang Wang) > Remove unnecessary TryEval in TrySum > > > Key: SPARK-40016 > URL: https://issues.apache.org/jira/browse/SPARK-40016 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > Remove unnecessary TryEval in TrySum for simplicity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40016) Remove unnecessary TryEval in TrySum
Gengliang Wang created SPARK-40016: -- Summary: Remove unnecessary TryEval in TrySum Key: SPARK-40016 URL: https://issues.apache.org/jira/browse/SPARK-40016 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Remove unnecessary TryEval in TrySum for simplicity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40015) Add sc.listArchives and sc.listFiles to PySpark
[ https://issues.apache.org/jira/browse/SPARK-40015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40015: Assignee: Apache Spark > Add sc.listArchives and sc.listFiles to PySpark > > > Key: SPARK-40015 > URL: https://issues.apache.org/jira/browse/SPARK-40015 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40015) Add sc.listArchives and sc.listFiles to PySpark
[ https://issues.apache.org/jira/browse/SPARK-40015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40015: Assignee: (was: Apache Spark) > Add sc.listArchives and sc.listFiles to PySpark > > > Key: SPARK-40015 > URL: https://issues.apache.org/jira/browse/SPARK-40015 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40015) Add sc.listArchives and sc.listFiles to PySpark
[ https://issues.apache.org/jira/browse/SPARK-40015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577188#comment-17577188 ] Apache Spark commented on SPARK-40015: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37445 > Add sc.listArchives and sc.listFiles to PySpark > > > Key: SPARK-40015 > URL: https://issues.apache.org/jira/browse/SPARK-40015 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40015) Add sc.listArchives and sc.listFiles to PySpark
Ruifeng Zheng created SPARK-40015: - Summary: Add sc.listArchives and sc.listFiles to PySpark Key: SPARK-40015 URL: https://issues.apache.org/jira/browse/SPARK-40015 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39752) Spark job failed with 10M rows data with Broken pipe error
[ https://issues.apache.org/jira/browse/SPARK-39752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHOBHIT SHUKLA updated SPARK-39752: --- Attachment: (was: Failed_spark_job_3.0.3.txt) > Spark job failed with 10M rows data with Broken pipe error > -- > > Key: SPARK-39752 > URL: https://issues.apache.org/jira/browse/SPARK-39752 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.3, 3.2.1 >Reporter: SHOBHIT SHUKLA >Priority: Major > Fix For: 3.0.2 > > > Spark job failed with 10M rows data with Broken pipe error. Same spark job > was working previously with the settings "executor_cores": 1, > "executor_memory": 1, "driver_cores": 1, "driver_memory": 1. where as the > same job is failing with spark settings in 3.0.3 and 3.2.1. > Major symptoms (slowness, timeout, out of memory as examples): Spark job is > failing with the error java.net.SocketException: Broken pipe (Write failed) > Here are the spark settings information which is working on Spark 3.0.3 and > 3.2.1 : "executor_cores": 4, "executor_memory": 4, "driver_cores": 4, > "driver_memory": 4.. The spark job doesn't consistently works with the above > settings. Some times, need to increase the cores and memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39752) Spark job failed with 10M rows data with Broken pipe error
[ https://issues.apache.org/jira/browse/SPARK-39752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHOBHIT SHUKLA updated SPARK-39752: --- Attachment: (was: spark_job_success_3.0.2.txt) > Spark job failed with 10M rows data with Broken pipe error > -- > > Key: SPARK-39752 > URL: https://issues.apache.org/jira/browse/SPARK-39752 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.3, 3.2.1 >Reporter: SHOBHIT SHUKLA >Priority: Major > Fix For: 3.0.2 > > > Spark job failed with 10M rows data with Broken pipe error. Same spark job > was working previously with the settings "executor_cores": 1, > "executor_memory": 1, "driver_cores": 1, "driver_memory": 1. where as the > same job is failing with spark settings in 3.0.3 and 3.2.1. > Major symptoms (slowness, timeout, out of memory as examples): Spark job is > failing with the error java.net.SocketException: Broken pipe (Write failed) > Here are the spark settings information which is working on Spark 3.0.3 and > 3.2.1 : "executor_cores": 4, "executor_memory": 4, "driver_cores": 4, > "driver_memory": 4.. The spark job doesn't consistently works with the above > settings. Some times, need to increase the cores and memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38699) Use error classes in the execution errors of dictionary encoding
[ https://issues.apache.org/jira/browse/SPARK-38699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577170#comment-17577170 ] Goutam Ghosh commented on SPARK-38699: -- [~maxgekk] can you please review the comments for pull request https://github.com/apache/spark/pull/37065 and advice should I remove the assertion and use the error classes for this change > Use error classes in the execution errors of dictionary encoding > > > Key: SPARK-38699 > URL: https://issues.apache.org/jira/browse/SPARK-38699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * useDictionaryEncodingWhenDictionaryOverflowError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39863) Upgrade Hadoop to 3.3.4
[ https://issues.apache.org/jira/browse/SPARK-39863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-39863. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37281 [https://github.com/apache/spark/pull/37281] > Upgrade Hadoop to 3.3.4 > --- > > Key: SPARK-39863 > URL: https://issues.apache.org/jira/browse/SPARK-39863 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.4.0 > > > This JIRA tracks the progress of upgrading Hadoop dependency to 3.3.4 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39863) Upgrade Hadoop to 3.3.4
[ https://issues.apache.org/jira/browse/SPARK-39863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-39863: - Assignee: Chao Sun > Upgrade Hadoop to 3.3.4 > --- > > Key: SPARK-39863 > URL: https://issues.apache.org/jira/browse/SPARK-39863 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > This JIRA tracks the progress of upgrading Hadoop dependency to 3.3.4 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40014) Support cast of decimals to ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-40014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40014: - Description: Support casts of decimal to ANSI intervals, and preserve the fractional parts of seconds in the casts. (was: Support casts of ANSI intervals to decimal, and preserve the fractional parts of seconds in the casts.) > Support cast of decimals to ANSI intervals > -- > > Key: SPARK-40014 > URL: https://issues.apache.org/jira/browse/SPARK-40014 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Support casts of decimal to ANSI intervals, and preserve the fractional parts > of seconds in the casts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40014) Support cast of decimals to ANSI intervals
Max Gekk created SPARK-40014: Summary: Support cast of decimals to ANSI intervals Key: SPARK-40014 URL: https://issues.apache.org/jira/browse/SPARK-40014 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Fix For: 3.4.0 Support casts of ANSI intervals to decimal, and preserve the fractional parts of seconds in the casts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40014) Support cast of decimals to ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-40014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40014: Assignee: (was: Max Gekk) > Support cast of decimals to ANSI intervals > -- > > Key: SPARK-40014 > URL: https://issues.apache.org/jira/browse/SPARK-40014 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Support casts of decimal to ANSI intervals, and preserve the fractional parts > of seconds in the casts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39470) Support cast of ANSI intervals to decimals
[ https://issues.apache.org/jira/browse/SPARK-39470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39470: - Parent: SPARK-27790 Issue Type: Sub-task (was: New Feature) > Support cast of ANSI intervals to decimals > -- > > Key: SPARK-39470 > URL: https://issues.apache.org/jira/browse/SPARK-39470 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Support casts of ANSI intervals to decimal, and preserve the fractional parts > of seconds in the casts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39451) Support casting intervals to integrals in ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-39451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39451: - Parent: SPARK-27790 Issue Type: Sub-task (was: New Feature) > Support casting intervals to integrals in ANSI mode > --- > > Key: SPARK-39451 > URL: https://issues.apache.org/jira/browse/SPARK-39451 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > Attachments: Screenshot 2022-06-12 at 13.04.44.png > > > To conform the SQL standard, support casting of interval types to *INT, see > the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40008) Support casting integrals to intervals in ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-40008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40008: - Parent: SPARK-27790 Issue Type: Sub-task (was: New Feature) > Support casting integrals to intervals in ANSI mode > --- > > Key: SPARK-40008 > URL: https://issues.apache.org/jira/browse/SPARK-40008 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > Attachments: Screenshot 2022-06-12 at 13.04.44.png > > > To conform the SQL standard, support casting of interval types to *INT, see > the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40008) Support casting integrals to intervals in ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-40008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40008. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37442 [https://github.com/apache/spark/pull/37442] > Support casting integrals to intervals in ANSI mode > --- > > Key: SPARK-40008 > URL: https://issues.apache.org/jira/browse/SPARK-40008 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > Attachments: Screenshot 2022-06-12 at 13.04.44.png > > > To conform the SQL standard, support casting of interval types to *INT, see > the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40013) DS V2 expressions should have the default implementation of toString
[ https://issues.apache.org/jira/browse/SPARK-40013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng resolved SPARK-40013. Resolution: Won't Fix > DS V2 expressions should have the default implementation of toString > > > Key: SPARK-40013 > URL: https://issues.apache.org/jira/browse/SPARK-40013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, V2 expressions missing the default toString and lead to unexpected > result. > We should add a default implementation in the base class Expression using > ToStringSQLBuilder. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40012: Assignee: Apache Spark > Make pyspark.sql.dataframe examples self-contained > -- > > Key: SPARK-40012 > URL: https://issues.apache.org/jira/browse/SPARK-40012 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: William Zijie Zhang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577127#comment-17577127 ] Apache Spark commented on SPARK-40012: -- User 'Transurgeon' has created a pull request for this issue: https://github.com/apache/spark/pull/37444 > Make pyspark.sql.dataframe examples self-contained > -- > > Key: SPARK-40012 > URL: https://issues.apache.org/jira/browse/SPARK-40012 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: William Zijie Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40012: Assignee: (was: Apache Spark) > Make pyspark.sql.dataframe examples self-contained > -- > > Key: SPARK-40012 > URL: https://issues.apache.org/jira/browse/SPARK-40012 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: William Zijie Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40013) DS V2 expressions should have the default implementation of toString
jiaan.geng created SPARK-40013: -- Summary: DS V2 expressions should have the default implementation of toString Key: SPARK-40013 URL: https://issues.apache.org/jira/browse/SPARK-40013 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng Currently, V2 expressions missing the default toString and lead to unexpected result. We should add a default implementation in the base class Expression using ToStringSQLBuilder. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Zijie Zhang updated SPARK-40012: Priority: Major (was: Minor) > Make pyspark.sql.dataframe examples self-contained > -- > > Key: SPARK-40012 > URL: https://issues.apache.org/jira/browse/SPARK-40012 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: William Zijie Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Zijie Zhang updated SPARK-40012: Component/s: PySpark > Make pyspark.sql.dataframe examples self-contained > -- > > Key: SPARK-40012 > URL: https://issues.apache.org/jira/browse/SPARK-40012 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: William Zijie Zhang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Zijie Zhang updated SPARK-40012: Affects Version/s: 3.4.0 (was: 3.3.0) > Make pyspark.sql.dataframe examples self-contained > -- > > Key: SPARK-40012 > URL: https://issues.apache.org/jira/browse/SPARK-40012 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.4.0 >Reporter: William Zijie Zhang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40012) Make pyspark.sql.group examples self-contained
William Zijie Zhang created SPARK-40012: --- Summary: Make pyspark.sql.group examples self-contained Key: SPARK-40012 URL: https://issues.apache.org/jira/browse/SPARK-40012 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.3.0 Reporter: William Zijie Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Zijie Zhang updated SPARK-40012: Summary: Make pyspark.sql.dataframe examples self-contained (was: Make pyspark.sql.group examples self-contained) > Make pyspark.sql.dataframe examples self-contained > -- > > Key: SPARK-40012 > URL: https://issues.apache.org/jira/browse/SPARK-40012 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.3.0 >Reporter: William Zijie Zhang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39819) DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
[ https://issues.apache.org/jira/browse/SPARK-39819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-39819. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37320 [https://github.com/apache/spark/pull/37320] > DS V2 aggregate push down can work with Top N or Paging (Sort with group > expressions) > - > > Key: SPARK-39819 > URL: https://issues.apache.org/jira/browse/SPARK-39819 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.0 > > > Currently, DS V2 aggregate push-down cannot work with Top N (order by ... > limit ...) or Paging (order by ... limit ... offset ...). > If it can work with Top N or Paging, it will be better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39819) DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
[ https://issues.apache.org/jira/browse/SPARK-39819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-39819: --- Assignee: jiaan.geng > DS V2 aggregate push down can work with Top N or Paging (Sort with group > expressions) > - > > Key: SPARK-39819 > URL: https://issues.apache.org/jira/browse/SPARK-39819 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > Currently, DS V2 aggregate push-down cannot work with Top N (order by ... > limit ...) or Paging (order by ... limit ... offset ...). > If it can work with Top N or Paging, it will be better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40010) Make pyspark.sql.window examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40010: - Summary: Make pyspark.sql.window examples self-contained (was: Make pyspark.sql.windown examples self-contained) > Make pyspark.sql.window examples self-contained > --- > > Key: SPARK-40010 > URL: https://issues.apache.org/jira/browse/SPARK-40010 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Qian Sun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40011) Pandas API on Spark requires Pandas
Daniel Oakley created SPARK-40011: - Summary: Pandas API on Spark requires Pandas Key: SPARK-40011 URL: https://issues.apache.org/jira/browse/SPARK-40011 Project: Spark Issue Type: Bug Components: Pandas API on Spark Affects Versions: 3.3.0 Reporter: Daniel Oakley Fix For: 3.3.1 Pandas API on Spark includes code like: > import pandas as pd > from pandas.api.types import is_hashable, is_list_like # type: > ignore[attr-defined] This breaks if you don't have pandas installed on your Spark cluster. Pandas API was supposed to be an API not pandas integration, why does it require pandas to be installed? In many places Spark jobs may be run on various Spark clusters with no assurance of particular Python packages installed at a root level. Can this dependency be removed? Or the required version of Pandas be bundled with the Spark distribution? Similar for numpy and other deps. If not the docs should clearly state it is not merely a Spark API that mirror the Pandas API, but something quite different. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40007) Add Mode to PySpark
[ https://issues.apache.org/jira/browse/SPARK-40007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-40007. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37438 [https://github.com/apache/spark/pull/37438] > Add Mode to PySpark > --- > > Key: SPARK-40007 > URL: https://issues.apache.org/jira/browse/SPARK-40007 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40007) Add Mode to PySpark
[ https://issues.apache.org/jira/browse/SPARK-40007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40007: - Assignee: Ruifeng Zheng > Add Mode to PySpark > --- > > Key: SPARK-40007 > URL: https://issues.apache.org/jira/browse/SPARK-40007 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40010) Make pyspark.sql.windown examples self-contained
Qian Sun created SPARK-40010: Summary: Make pyspark.sql.windown examples self-contained Key: SPARK-40010 URL: https://issues.apache.org/jira/browse/SPARK-40010 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Qian Sun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40006) Make pyspark.sql.group examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40006: Assignee: Hyukjin Kwon (was: Apache Spark) > Make pyspark.sql.group examples self-contained > -- > > Key: SPARK-40006 > URL: https://issues.apache.org/jira/browse/SPARK-40006 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40006) Make pyspark.sql.group examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40006. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37437 [https://github.com/apache/spark/pull/37437] > Make pyspark.sql.group examples self-contained > -- > > Key: SPARK-40006 > URL: https://issues.apache.org/jira/browse/SPARK-40006 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40006) Make pyspark.sql.group examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40006: Assignee: Apache Spark > Make pyspark.sql.group examples self-contained > -- > > Key: SPARK-40006 > URL: https://issues.apache.org/jira/browse/SPARK-40006 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40002) Limit improperly pushed down through window using ntile function
[ https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40002. -- Fix Version/s: 3.3.1 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 37443 [https://github.com/apache/spark/pull/37443] > Limit improperly pushed down through window using ntile function > > > Key: SPARK-40002 > URL: https://issues.apache.org/jira/browse/SPARK-40002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Labels: correctness > Fix For: 3.3.1, 3.2.3, 3.4.0 > > > Limit is pushed down through a window using the ntile function, which causes > results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions > of Spark (e.g., 3.1.3). > Assume this data: > {noformat} > create table t1 stored as parquet as > select * > from range(101); > {noformat} > Also assume this query: > {noformat} > select id, ntile(10) over (order by id) as nt > from t1 > limit 10; > {noformat} > Spark 3.2.2, Spark 3.3.0, and master produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |2 | > |2 |3 | > |3 |4 | > |4 |5 | > |5 |6 | > |6 |7 | > |7 |8 | > |8 |9 | > |9 |10 | > +---+---+ > {noformat} > However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |1 | > |2 |1 | > |3 |1 | > |4 |1 | > |5 |1 | > |6 |1 | > |7 |1 | > |8 |1 | > |9 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39994) How to write (save) PySpark dataframe containing vector column?
[ https://issues.apache.org/jira/browse/SPARK-39994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577096#comment-17577096 ] Hyukjin Kwon commented on SPARK-39994: -- It has to be included hadoop instead of spark. > How to write (save) PySpark dataframe containing vector column? > --- > > Key: SPARK-39994 > URL: https://issues.apache.org/jira/browse/SPARK-39994 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Muhammad Kaleem Ullah >Priority: Major > Attachments: df.PNG, error.PNG > > Original Estimate: 168h > Remaining Estimate: 168h > > I'm trying to same the PySpark dataframe after transforming it using ML > Pipeline. But when I save it the weird error is triggered every time. Here > are the columns of this dataframe: > |-- label: integer (nullable = true) > |-- dest_index: double (nullable = false) > |-- dest_fact: vector (nullable = true) > |-- carrier_index: double (nullable = false) > |-- carrier_fact: vector (nullable = true) > |-- features: vector (nullable = true) > And the following error occurs when trying to save this dataframe that > contains vector data: > {code:java} > // training.write.parquet("training_files.parquet", mode = "overwrite") {code} > {noformat} > Py4JJavaError: An error occurred while calling o440.parquet. : > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:278) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > ... > {noformat} > > I tried to use differently available {{winutils}} for Hadoop from [this > GitHub repository|https://github.com/cdarlint/winutils] but with not much > luck. Please help me in this regard. How can I save this dataframe so that I > can read it in any other jupyter notebook file? Feel free to ask any > questions. Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40002) Limit improperly pushed down through window using ntile function
[ https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40002: Assignee: Bruce Robbins > Limit improperly pushed down through window using ntile function > > > Key: SPARK-40002 > URL: https://issues.apache.org/jira/browse/SPARK-40002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Labels: correctness > > Limit is pushed down through a window using the ntile function, which causes > results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions > of Spark (e.g., 3.1.3). > Assume this data: > {noformat} > create table t1 stored as parquet as > select * > from range(101); > {noformat} > Also assume this query: > {noformat} > select id, ntile(10) over (order by id) as nt > from t1 > limit 10; > {noformat} > Spark 3.2.2, Spark 3.3.0, and master produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |2 | > |2 |3 | > |3 |4 | > |4 |5 | > |5 |6 | > |6 |7 | > |7 |8 | > |8 |9 | > |9 |10 | > +---+---+ > {noformat} > However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |1 | > |2 |1 | > |3 |1 | > |4 |1 | > |5 |1 | > |6 |1 | > |7 |1 | > |8 |1 | > |9 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries
[ https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577057#comment-17577057 ] Haejoon Lee edited comment on SPARK-39995 at 8/9/22 12:37 AM: -- Let me take a look was (Author: itholic): Let ma take a look > PySpark installation doesn't support Scala 2.13 binaries > > > Key: SPARK-39995 > URL: https://issues.apache.org/jira/browse/SPARK-39995 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Oleksandr Shevchenko >Priority: Major > > [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary > [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi] > for Scala 2.13. > Currently, the setup > [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py] > allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror > (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always > Scala 2.12 compatible binaries. There isn't any parameter to download > "spark-3.3.0-bin-hadoop3-scala2.13.tgz". > It's possible to download Spark manually and set the needed SPARK_HOME, but > it's hard to use with pip or Poetry. > Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI > but not possible with package managers like Poetry. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries
[ https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577057#comment-17577057 ] Haejoon Lee commented on SPARK-39995: - Let ma take a look > PySpark installation doesn't support Scala 2.13 binaries > > > Key: SPARK-39995 > URL: https://issues.apache.org/jira/browse/SPARK-39995 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Oleksandr Shevchenko >Priority: Major > > [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary > [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi] > for Scala 2.13. > Currently, the setup > [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py] > allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror > (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always > Scala 2.12 compatible binaries. There isn't any parameter to download > "spark-3.3.0-bin-hadoop3-scala2.13.tgz". > It's possible to download Spark manually and set the needed SPARK_HOME, but > it's hard to use with pip or Poetry. > Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI > but not possible with package managers like Poetry. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40003) Add median to PySpark
[ https://issues.apache.org/jira/browse/SPARK-40003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40003: - Assignee: Ruifeng Zheng > Add median to PySpark > - > > Key: SPARK-40003 > URL: https://issues.apache.org/jira/browse/SPARK-40003 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40003) Add median to PySpark
[ https://issues.apache.org/jira/browse/SPARK-40003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-40003. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37434 [https://github.com/apache/spark/pull/37434] > Add median to PySpark > - > > Key: SPARK-40003 > URL: https://issues.apache.org/jira/browse/SPARK-40003 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40002) Limit improperly pushed down through window using ntile function
[ https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576973#comment-17576973 ] Apache Spark commented on SPARK-40002: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/37443 > Limit improperly pushed down through window using ntile function > > > Key: SPARK-40002 > URL: https://issues.apache.org/jira/browse/SPARK-40002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Limit is pushed down through a window using the ntile function, which causes > results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions > of Spark (e.g., 3.1.3). > Assume this data: > {noformat} > create table t1 stored as parquet as > select * > from range(101); > {noformat} > Also assume this query: > {noformat} > select id, ntile(10) over (order by id) as nt > from t1 > limit 10; > {noformat} > Spark 3.2.2, Spark 3.3.0, and master produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |2 | > |2 |3 | > |3 |4 | > |4 |5 | > |5 |6 | > |6 |7 | > |7 |8 | > |8 |9 | > |9 |10 | > +---+---+ > {noformat} > However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |1 | > |2 |1 | > |3 |1 | > |4 |1 | > |5 |1 | > |6 |1 | > |7 |1 | > |8 |1 | > |9 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40002) Limit improperly pushed down through window using ntile function
[ https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40002: Assignee: Apache Spark > Limit improperly pushed down through window using ntile function > > > Key: SPARK-40002 > URL: https://issues.apache.org/jira/browse/SPARK-40002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Major > Labels: correctness > > Limit is pushed down through a window using the ntile function, which causes > results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions > of Spark (e.g., 3.1.3). > Assume this data: > {noformat} > create table t1 stored as parquet as > select * > from range(101); > {noformat} > Also assume this query: > {noformat} > select id, ntile(10) over (order by id) as nt > from t1 > limit 10; > {noformat} > Spark 3.2.2, Spark 3.3.0, and master produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |2 | > |2 |3 | > |3 |4 | > |4 |5 | > |5 |6 | > |6 |7 | > |7 |8 | > |8 |9 | > |9 |10 | > +---+---+ > {noformat} > However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |1 | > |2 |1 | > |3 |1 | > |4 |1 | > |5 |1 | > |6 |1 | > |7 |1 | > |8 |1 | > |9 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40002) Limit improperly pushed down through window using ntile function
[ https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40002: Assignee: (was: Apache Spark) > Limit improperly pushed down through window using ntile function > > > Key: SPARK-40002 > URL: https://issues.apache.org/jira/browse/SPARK-40002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Limit is pushed down through a window using the ntile function, which causes > results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions > of Spark (e.g., 3.1.3). > Assume this data: > {noformat} > create table t1 stored as parquet as > select * > from range(101); > {noformat} > Also assume this query: > {noformat} > select id, ntile(10) over (order by id) as nt > from t1 > limit 10; > {noformat} > Spark 3.2.2, Spark 3.3.0, and master produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |2 | > |2 |3 | > |3 |4 | > |4 |5 | > |5 |6 | > |6 |7 | > |7 |8 | > |8 |9 | > |9 |10 | > +---+---+ > {noformat} > However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |1 | > |2 |1 | > |3 |1 | > |4 |1 | > |5 |1 | > |6 |1 | > |7 |1 | > |8 |1 | > |9 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40002) Limit improperly pushed down through window using ntile function
[ https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-40002: -- Labels: correctness (was: ) > Limit improperly pushed down through window using ntile function > > > Key: SPARK-40002 > URL: https://issues.apache.org/jira/browse/SPARK-40002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Limit is pushed down through a window using the ntile function, which causes > results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions > of Spark (e.g., 3.1.3). > Assume this data: > {noformat} > create table t1 stored as parquet as > select * > from range(101); > {noformat} > Also assume this query: > {noformat} > select id, ntile(10) over (order by id) as nt > from t1 > limit 10; > {noformat} > Spark 3.2.2, Spark 3.3.0, and master produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |2 | > |2 |3 | > |3 |4 | > |4 |5 | > |5 |6 | > |6 |7 | > |7 |8 | > |8 |9 | > |9 |10 | > +---+---+ > {noformat} > However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |1 | > |2 |1 | > |3 |1 | > |4 |1 | > |5 |1 | > |6 |1 | > |7 |1 | > |8 |1 | > |9 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40002) Limit improperly pushed down through window using ntile function
[ https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-40002: -- Summary: Limit improperly pushed down through window using ntile function (was: Limit pushed down through window using ntile function) > Limit improperly pushed down through window using ntile function > > > Key: SPARK-40002 > URL: https://issues.apache.org/jira/browse/SPARK-40002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bruce Robbins >Priority: Major > > Limit is pushed down through a window using the ntile function, which causes > results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions > of Spark (e.g., 3.1.3). > Assume this data: > {noformat} > create table t1 stored as parquet as > select * > from range(101); > {noformat} > Also assume this query: > {noformat} > select id, ntile(10) over (order by id) as nt > from t1 > limit 10; > {noformat} > Spark 3.2.2, Spark 3.3.0, and master produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |2 | > |2 |3 | > |3 |4 | > |4 |5 | > |5 |6 | > |6 |7 | > |7 |8 | > |8 |9 | > |9 |10 | > +---+---+ > {noformat} > However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following: > {noformat} > +---+---+ > |id |nt | > +---+---+ > |0 |1 | > |1 |1 | > |2 |1 | > |3 |1 | > |4 |1 | > |5 |1 | > |6 |1 | > |7 |1 | > |8 |1 | > |9 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40004) Redundant `LevelDB.get` in `RemoteBlockPushResolver`
[ https://issues.apache.org/jira/browse/SPARK-40004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-40004. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37435 [https://github.com/apache/spark/pull/37435] > Redundant `LevelDB.get` in `RemoteBlockPushResolver` > > > Key: SPARK-40004 > URL: https://issues.apache.org/jira/browse/SPARK-40004 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > {code:java} > void removeAppAttemptPathInfoFromDB(String appId, int attemptId) { > AppAttemptId appAttemptId = new AppAttemptId(appId, attemptId); > if (db != null) { > try { > byte[] key = getDbAppAttemptPathsKey(appAttemptId); > if (db.get(key) != null) { > db.delete(key); > } > } catch (Exception e) { > logger.error("Failed to remove the application attempt {} local path in > DB", > appAttemptId, e); > } > } > } > {code} > No need to check `db.get(key) != null` before delete. LevelDB will handle > this scene. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40004) Redundant `LevelDB.get` in `RemoteBlockPushResolver`
[ https://issues.apache.org/jira/browse/SPARK-40004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-40004: --- Assignee: Yang Jie > Redundant `LevelDB.get` in `RemoteBlockPushResolver` > > > Key: SPARK-40004 > URL: https://issues.apache.org/jira/browse/SPARK-40004 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > {code:java} > void removeAppAttemptPathInfoFromDB(String appId, int attemptId) { > AppAttemptId appAttemptId = new AppAttemptId(appId, attemptId); > if (db != null) { > try { > byte[] key = getDbAppAttemptPathsKey(appAttemptId); > if (db.get(key) != null) { > db.delete(key); > } > } catch (Exception e) { > logger.error("Failed to remove the application attempt {} local path in > DB", > appAttemptId, e); > } > } > } > {code} > No need to check `db.get(key) != null` before delete. LevelDB will handle > this scene. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39993) Spark on Kubernetes doesn't filter data by date
[ https://issues.apache.org/jira/browse/SPARK-39993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hanna Liashchuk updated SPARK-39993: Description: I'm creating a Dataset with type date and saving it into s3. When I read it and try to use where() clause, I've noticed it doesn't return data even though it's there Below is the code snippet I'm running {code:java} from pyspark.sql.types import Row from pyspark.sql.functions import * ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", col("date").cast("date")) ds.where("date = '2022-01-01'").show() ds.write.mode("overwrite").parquet("s3a://bucket/test") df = spark.read.format("parquet").load("s3a://bucket/test") df.where("date = '2022-01-01'").show() {code} The first show() returns data, while the second one - no. I've noticed that it's Kubernetes master related, as the same code snipped works ok with master "local" UPD: if the column is used as a partition and has the type "date" or is de facto date but has the type "string", there is no filtering problem. was: I'm creating a Dataset with type date and saving it into s3. When I read it and try to use where() clause, I've noticed it doesn't return data even though it's there Below is the code snippet I'm running {code:java} from pyspark.sql.types import Row from pyspark.sql.functions import * ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", col("date").cast("date")) ds.where("date = '2022-01-01'").show() ds.write.mode("overwrite").parquet("s3a://bucket/test") df = spark.read.format("parquet").load("s3a://bucket/test") df.where("date = '2022-01-01'").show() {code} The first show() returns data, while the second one - no. I've noticed that it's Kubernetes master related, as the same code snipped works ok with master "local" UPD: if the column is used as a partition and has the type "date", there is no filtering problem. > Spark on Kubernetes doesn't filter data by date > --- > > Key: SPARK-39993 > URL: https://issues.apache.org/jira/browse/SPARK-39993 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 > Environment: Kubernetes v1.23.6 > Spark 3.2.2 > Java 1.8.0_312 > Python 3.9.13 > Aws dependencies: > aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar >Reporter: Hanna Liashchuk >Priority: Major > Labels: kubernetes > > I'm creating a Dataset with type date and saving it into s3. When I read it > and try to use where() clause, I've noticed it doesn't return data even > though it's there > Below is the code snippet I'm running > > {code:java} > from pyspark.sql.types import Row > from pyspark.sql.functions import * > ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", > col("date").cast("date")) > ds.where("date = '2022-01-01'").show() > ds.write.mode("overwrite").parquet("s3a://bucket/test") > df = spark.read.format("parquet").load("s3a://bucket/test") > df.where("date = '2022-01-01'").show() > {code} > The first show() returns data, while the second one - no. > I've noticed that it's Kubernetes master related, as the same code snipped > works ok with master "local" > UPD: if the column is used as a partition and has the type "date" or is de > facto date but has the type "string", there is no filtering problem. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40009) Add doc string to DataFrame union and unionAll
[ https://issues.apache.org/jira/browse/SPARK-40009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576933#comment-17576933 ] Apache Spark commented on SPARK-40009: -- User 'khalidmammadov' has created a pull request for this issue: https://github.com/apache/spark/pull/37441 > Add doc string to DataFrame union and unionAll > -- > > Key: SPARK-40009 > URL: https://issues.apache.org/jira/browse/SPARK-40009 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Khalid Mammadov >Priority: Minor > > Provide examples for DataFrame union and unionAll functions for PySpark. Also > document parameters -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40009) Add doc string to DataFrame union and unionAll
[ https://issues.apache.org/jira/browse/SPARK-40009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40009: Assignee: Apache Spark > Add doc string to DataFrame union and unionAll > -- > > Key: SPARK-40009 > URL: https://issues.apache.org/jira/browse/SPARK-40009 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Khalid Mammadov >Assignee: Apache Spark >Priority: Minor > > Provide examples for DataFrame union and unionAll functions for PySpark. Also > document parameters -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40009) Add doc string to DataFrame union and unionAll
[ https://issues.apache.org/jira/browse/SPARK-40009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40009: Assignee: (was: Apache Spark) > Add doc string to DataFrame union and unionAll > -- > > Key: SPARK-40009 > URL: https://issues.apache.org/jira/browse/SPARK-40009 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Khalid Mammadov >Priority: Minor > > Provide examples for DataFrame union and unionAll functions for PySpark. Also > document parameters -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40009) Add doc string to DataFrame union and unionAll
Khalid Mammadov created SPARK-40009: --- Summary: Add doc string to DataFrame union and unionAll Key: SPARK-40009 URL: https://issues.apache.org/jira/browse/SPARK-40009 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.4.0 Reporter: Khalid Mammadov Provide examples for DataFrame union and unionAll functions for PySpark. Also document parameters -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39993) Spark on Kubernetes doesn't filter data by date
[ https://issues.apache.org/jira/browse/SPARK-39993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hanna Liashchuk updated SPARK-39993: Description: I'm creating a Dataset with type date and saving it into s3. When I read it and try to use where() clause, I've noticed it doesn't return data even though it's there Below is the code snippet I'm running {code:java} from pyspark.sql.types import Row from pyspark.sql.functions import * ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", col("date").cast("date")) ds.where("date = '2022-01-01'").show() ds.write.mode("overwrite").parquet("s3a://bucket/test") df = spark.read.format("parquet").load("s3a://bucket/test") df.where("date = '2022-01-01'").show() {code} The first show() returns data, while the second one - no. I've noticed that it's Kubernetes master related, as the same code snipped works ok with master "local" UPD: if the column is used as a partition and has the type "date", there is no filtering problem. was: I'm creating a Dataset with type date and saving it into s3. When I read it and try to use where() clause, I've noticed it doesn't return data even though it's there Below is the code snippet I'm running {code:java} from pyspark.sql.types import Row from pyspark.sql.functions import * ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", col("date").cast("date")) ds.where("date = '2022-01-01'").show() ds.write.mode("overwrite").parquet("s3a://bucket/test") df = spark.read.format("parquet").load("s3a://bucket/test") df.where("date = '2022-01-01'").show() {code} The first show() returns data, while the second one - no. I've noticed that it's Kubernetes master related, as the same code snipped works ok with master "local" > Spark on Kubernetes doesn't filter data by date > --- > > Key: SPARK-39993 > URL: https://issues.apache.org/jira/browse/SPARK-39993 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 > Environment: Kubernetes v1.23.6 > Spark 3.2.2 > Java 1.8.0_312 > Python 3.9.13 > Aws dependencies: > aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar >Reporter: Hanna Liashchuk >Priority: Major > Labels: kubernetes > > I'm creating a Dataset with type date and saving it into s3. When I read it > and try to use where() clause, I've noticed it doesn't return data even > though it's there > Below is the code snippet I'm running > > {code:java} > from pyspark.sql.types import Row > from pyspark.sql.functions import * > ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", > col("date").cast("date")) > ds.where("date = '2022-01-01'").show() > ds.write.mode("overwrite").parquet("s3a://bucket/test") > df = spark.read.format("parquet").load("s3a://bucket/test") > df.where("date = '2022-01-01'").show() > {code} > The first show() returns data, while the second one - no. > I've noticed that it's Kubernetes master related, as the same code snipped > works ok with master "local" > UPD: if the column is used as a partition and has the type "date", there is > no filtering problem. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39965) Skip PVC cleanup when driver doesn't own PVCs
[ https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-39965. --- Fix Version/s: 3.3.1 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 37433 [https://github.com/apache/spark/pull/37433] > Skip PVC cleanup when driver doesn't own PVCs > - > > Key: SPARK-39965 > URL: https://issues.apache.org/jira/browse/SPARK-39965 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: pralabhkumar >Priority: Trivial > Fix For: 3.3.1, 3.2.3, 3.4.0 > > > From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , > functionality is added to delete PVC if the Spark driver died. > [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] > > However there are cases , where spark on K8s doesn't use PVC and use host > path for storage. > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > Now in those cases , > * it request to delete PVC (which is not required) . > * It also tries to delete in the case where driver doesn't own the PV (or > spark.kubernetes.driver.ownPersistentVolumeClaim is false) > * Moreover in the cluster , where Spark user doesn't have access to list or > delete PVC , it throws exception . > > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > GET at: > [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1]. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. persistentvolumeclaims is forbidden: User > "system:serviceaccount:dpi-dev:spark" cannot list resource > "persistentvolumeclaims" in API group "" in the namespace "<>". > > *Solution* > Ideally there should be configuration > spark.kubernetes.driver.pvc.deleteOnTermination or use > spark.kubernetes.driver.ownPersistentVolumeClaim which should be checked > before calling to delete PVC. If user have not set up PV or if the driver > doesn't own then there is no need to call the api and delete PVC . > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40008) Support casting integrals to intervals in ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-40008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40008: Assignee: Max Gekk (was: Apache Spark) > Support casting integrals to intervals in ANSI mode > --- > > Key: SPARK-40008 > URL: https://issues.apache.org/jira/browse/SPARK-40008 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Attachments: Screenshot 2022-06-12 at 13.04.44.png > > > To conform the SQL standard, support casting of interval types to *INT, see > the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40008) Support casting integrals to intervals in ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-40008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576886#comment-17576886 ] Apache Spark commented on SPARK-40008: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37442 > Support casting integrals to intervals in ANSI mode > --- > > Key: SPARK-40008 > URL: https://issues.apache.org/jira/browse/SPARK-40008 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Attachments: Screenshot 2022-06-12 at 13.04.44.png > > > To conform the SQL standard, support casting of interval types to *INT, see > the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40008) Support casting integrals to intervals in ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-40008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40008: Assignee: Apache Spark (was: Max Gekk) > Support casting integrals to intervals in ANSI mode > --- > > Key: SPARK-40008 > URL: https://issues.apache.org/jira/browse/SPARK-40008 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > Attachments: Screenshot 2022-06-12 at 13.04.44.png > > > To conform the SQL standard, support casting of interval types to *INT, see > the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40008) Support casting integrals to intervals in ANSI mode
Max Gekk created SPARK-40008: Summary: Support casting integrals to intervals in ANSI mode Key: SPARK-40008 URL: https://issues.apache.org/jira/browse/SPARK-40008 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Fix For: 3.4.0 Attachments: Screenshot 2022-06-12 at 13.04.44.png To conform the SQL standard, support casting of interval types to *INT, see the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40008) Support casting integrals to intervals in ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-40008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40008: - Fix Version/s: (was: 3.4.0) > Support casting integrals to intervals in ANSI mode > --- > > Key: SPARK-40008 > URL: https://issues.apache.org/jira/browse/SPARK-40008 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Attachments: Screenshot 2022-06-12 at 13.04.44.png > > > To conform the SQL standard, support casting of interval types to *INT, see > the attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39828) Catalog.listTables() should respect currentCatalog
[ https://issues.apache.org/jira/browse/SPARK-39828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-39828: --- Assignee: Wenchen Fan > Catalog.listTables() should respect currentCatalog > -- > > Key: SPARK-39828 > URL: https://issues.apache.org/jira/browse/SPARK-39828 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39912) Refine CatalogImpl
[ https://issues.apache.org/jira/browse/SPARK-39912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-39912. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37287 [https://github.com/apache/spark/pull/37287] > Refine CatalogImpl > -- > > Key: SPARK-39912 > URL: https://issues.apache.org/jira/browse/SPARK-39912 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39828) Catalog.listTables() should respect currentCatalog
[ https://issues.apache.org/jira/browse/SPARK-39828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-39828. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37287 [https://github.com/apache/spark/pull/37287] > Catalog.listTables() should respect currentCatalog > -- > > Key: SPARK-39828 > URL: https://issues.apache.org/jira/browse/SPARK-39828 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39912) Refine CatalogImpl
[ https://issues.apache.org/jira/browse/SPARK-39912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-39912: --- Assignee: Wenchen Fan > Refine CatalogImpl > -- > > Key: SPARK-39912 > URL: https://issues.apache.org/jira/browse/SPARK-39912 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35304) [k8s] Though finishing a job, the driver pod is running infinitely
[ https://issues.apache.org/jira/browse/SPARK-35304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576866#comment-17576866 ] Emilie Lin commented on SPARK-35304: Hi [~ocworld] do you have any updates for this issue? > [k8s] Though finishing a job, the driver pod is running infinitely > -- > > Key: SPARK-35304 > URL: https://issues.apache.org/jira/browse/SPARK-35304 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1, 3.0.2, 3.1.1 >Reporter: Keunhyun Oh >Priority: Major > > Though finishing a job, the driver pod is running infinitely. > Executors are terminated. However, the driver status is not changed to > succeeded. > It is not experienced in spark 2 on k8s. > It is only appeared on spark 3. > > my jvm dump is that > {code:java} > 2021-05-04 15:11:37 > Full thread dump OpenJDK 64-Bit Server VM (25.252-b09 mixed mode): > "Attach Listener" #182 daemon prio=9 os_prio=0 tid=0x7f02bc001000 > nid=0x106 waiting on condition [0x] >java.lang.Thread.State: RUNNABLE >Locked ownable synchronizers: > - None > "DestroyJavaVM" #179 prio=5 os_prio=0 tid=0x7f0fe0017000 nid=0x35 waiting > on condition [0x] >java.lang.Thread.State: RUNNABLE >Locked ownable synchronizers: > - None > "s3a-transfer-unbounded-pool2-t1" #172 daemon prio=5 os_prio=0 > tid=0x7f025d98d000 nid=0xe5 waiting on condition [0x7f01f86f3000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f0353681b38> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - None > "java-sdk-progress-listener-callback-thread" #169 daemon prio=5 os_prio=0 > tid=0x7f002000f000 nid=0xe2 waiting on condition [0x7f004f7f6000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f0bdb1ba7c0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - None > "pool-26-thread-1" #72 prio=5 os_prio=0 tid=0x7f025c829000 nid=0x80 > waiting on condition [0x7f01ba931000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f0bfdeaa8f0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - None > "java-sdk-http-connection-reaper" #56 daemon prio=5 os_prio=0 > tid=0x7f025d818000 nid=0x6e waiting on condition [0x7f01fb9fe000] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > com.amazonaws.http.IdleConnectionReaper.run(IdleConnectionReaper.java:188) >Locked ownable synchron
[jira] [Commented] (SPARK-39994) How to write (save) PySpark dataframe containing vector column?
[ https://issues.apache.org/jira/browse/SPARK-39994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576853#comment-17576853 ] Muhammad Kaleem Ullah commented on SPARK-39994: --- I would like to request that it should be included and provided in the package of PySpark so that we can expect to get a full-fledged version of PySpark that includes all functionality (including the write dataframe facility) so that we don't have to spend days on it. It's a humble request. Thanks > How to write (save) PySpark dataframe containing vector column? > --- > > Key: SPARK-39994 > URL: https://issues.apache.org/jira/browse/SPARK-39994 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Muhammad Kaleem Ullah >Priority: Major > Attachments: df.PNG, error.PNG > > Original Estimate: 168h > Remaining Estimate: 168h > > I'm trying to same the PySpark dataframe after transforming it using ML > Pipeline. But when I save it the weird error is triggered every time. Here > are the columns of this dataframe: > |-- label: integer (nullable = true) > |-- dest_index: double (nullable = false) > |-- dest_fact: vector (nullable = true) > |-- carrier_index: double (nullable = false) > |-- carrier_fact: vector (nullable = true) > |-- features: vector (nullable = true) > And the following error occurs when trying to save this dataframe that > contains vector data: > {code:java} > // training.write.parquet("training_files.parquet", mode = "overwrite") {code} > {noformat} > Py4JJavaError: An error occurred while calling o440.parquet. : > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:278) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > ... > {noformat} > > I tried to use differently available {{winutils}} for Hadoop from [this > GitHub repository|https://github.com/cdarlint/winutils] but with not much > luck. Please help me in this regard. How can I save this dataframe so that I > can read it in any other jupyter notebook file? Feel free to ask any > questions. Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39994) How to write (save) PySpark dataframe containing vector column?
[ https://issues.apache.org/jira/browse/SPARK-39994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576851#comment-17576851 ] Muhammad Kaleem Ullah commented on SPARK-39994: --- Hi [~hyukjin.kwon], here is the full stack trace: {{--- Py4JJavaError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_4448\2574092106.py in () > 1 training_df.write.format("parquet").mode("overwrite").save("training_data") ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\sql\readwriter.py in save(self, path, format, mode, partitionBy, **options)966 self._jwrite.save()967 else: --> 968 self._jwrite.save(path)969970 @since(1.4) ~\AppData\Local\Programs\Python\Python310\lib\site-packages\py4j\java_gateway.py in __call__(self, *args) 13191320 answer = self.gateway_client.send_command(command) -> 1321 return_value = get_return_value( 1322 answer, self.gateway_client, self.target_id, self.name) 1323 ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)188 def deco(*a: Any, **kw: Any) -> Any:189 try: --> 190 return f(*a, **kw)191 except Py4JJavaError as e:192 converted = convert_exception(e.java_exception) ~\AppData\Local\Programs\Python\Python310\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError(327 "An error occurred while calling \{0}{1}\{2}.\n".328 format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o357.save. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:278) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79) at org.apache.spark.sql.execution.Q
[jira] [Commented] (SPARK-39965) Skip PVC cleanup when driver doesn't own PVCs
[ https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576793#comment-17576793 ] pralabhkumar commented on SPARK-39965: -- [~dongjoon] Thx for taking this . This is really helpful > Skip PVC cleanup when driver doesn't own PVCs > - > > Key: SPARK-39965 > URL: https://issues.apache.org/jira/browse/SPARK-39965 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: pralabhkumar >Priority: Trivial > > From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , > functionality is added to delete PVC if the Spark driver died. > [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] > > However there are cases , where spark on K8s doesn't use PVC and use host > path for storage. > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > Now in those cases , > * it request to delete PVC (which is not required) . > * It also tries to delete in the case where driver doesn't own the PV (or > spark.kubernetes.driver.ownPersistentVolumeClaim is false) > * Moreover in the cluster , where Spark user doesn't have access to list or > delete PVC , it throws exception . > > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > GET at: > [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1]. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. persistentvolumeclaims is forbidden: User > "system:serviceaccount:dpi-dev:spark" cannot list resource > "persistentvolumeclaims" in API group "" in the namespace "<>". > > *Solution* > Ideally there should be configuration > spark.kubernetes.driver.pvc.deleteOnTermination or use > spark.kubernetes.driver.ownPersistentVolumeClaim which should be checked > before calling to delete PVC. If user have not set up PV or if the driver > doesn't own then there is no need to call the api and delete PVC . > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param
[ https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-39976: -- Labels: correctness (was: ) > NULL check in ArrayIntersect adds extraneous null from first param > -- > > Key: SPARK-39976 > URL: https://issues.apache.org/jira/browse/SPARK-39976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Navin Kumar >Priority: Major > Labels: correctness > > This is very likely a regression from SPARK-36829. > When using {{array_intersect(a, b)}}, if the first parameter contains a > {{NULL}} value and the second one does not, an extraneous {{NULL}} is present > in the output. This also leads to {{array_intersect(a, b) != > array_intersect(b, a)}} which is incorrect as set intersection should be > commutative. > Example using PySpark: > {code:python} > >>> a = [1, 2, 3] > >>> b = [3, None, 5] > >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) > >>> df.show() > +-++ > |a| b| > +-++ > |[1, 2, 3]|[3, null, 5]| > +-++ > >>> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > | [3]| > +-+ > >>> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > |[3, null]| > +-+ > {code} > Note that in the first case, {{a}} does not contain a {{NULL}}, and the final > output is correct: {{[3]}}. In the second case, since {{b}} does contain > {{NULL}} and is now the first parameter. > The same behavior occurs in Scala when writing to Parquet: > {code:scala} > scala> val a = Array[java.lang.Integer](1, 2, null, 4) > a: Array[Integer] = Array(1, 2, null, 4) > scala> val b = Array[java.lang.Integer](4, 5, 6, 7) > b: Array[Integer] = Array(4, 5, 6, 7) > scala> val df = Seq((a, b)).toDF("a","b") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.write.parquet("/tmp/simple.parquet") > scala> val df = spark.read.parquet("/tmp/simple.parquet") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.show() > +---++ > | a| b| > +---++ > |[1, 2, null, 4]|[4, 5, 6, 7]| > +---++ > scala> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > |[null, 4]| > +-+ > scala> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > | [4]| > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file
[ https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576761#comment-17576761 ] Apache Spark commented on SPARK-36663: -- User 'mcdull-zhang' has created a pull request for this issue: https://github.com/apache/spark/pull/37440 > When the existing field name is a number, an error will be reported when > reading the orc file > - > > Key: SPARK-36663 > URL: https://issues.apache.org/jira/browse/SPARK-36663 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: mcdull_zhang >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.3.0 > > Attachments: image-2021-09-03-20-56-28-846.png > > > You can use the following methods to reproduce the problem: > {quote}val path = "file:///tmp/test_orc" > spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path) > spark.read.orc(path) > {quote} > The error message is like this: > {quote}org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '100' expecting {'ADD', 'AFTER' > == SQL == > struct<100:bigint> > ---^^^ > {quote} > The error is actually issued by this line of code: > {quote}CatalystSqlParser.parseDataType("100:bigint") > {quote} > > The specific background is that spark calls the above code in the process of > converting the schema of the orc file into the catalyst schema. > {quote}// code in OrcUtils > private def toCatalystSchema(schema: TypeDescription): StructType = > Unknown macro: \{ > CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]) > }{quote} > There are two solutions I currently think of: > # Modify the syntax analysis of SparkSQL to identify this kind of schema > # The TypeDescription.toString method should add the quote symbol to the > numeric column name, because the following syntax is supported: > {quote}CatalystSqlParser.parseDataType("`100`:bigint") > {quote} > But currently TypeDescription does not support changing the UNQUOTED_NAMES > variable, should we first submit a pr to the orc project to support the > configuration of this variable。 > !image-2021-09-03-20-56-28-846.png! > > How do spark members think about this issue? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file
[ https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576759#comment-17576759 ] Apache Spark commented on SPARK-36663: -- User 'mcdull-zhang' has created a pull request for this issue: https://github.com/apache/spark/pull/37440 > When the existing field name is a number, an error will be reported when > reading the orc file > - > > Key: SPARK-36663 > URL: https://issues.apache.org/jira/browse/SPARK-36663 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: mcdull_zhang >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.3.0 > > Attachments: image-2021-09-03-20-56-28-846.png > > > You can use the following methods to reproduce the problem: > {quote}val path = "file:///tmp/test_orc" > spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path) > spark.read.orc(path) > {quote} > The error message is like this: > {quote}org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '100' expecting {'ADD', 'AFTER' > == SQL == > struct<100:bigint> > ---^^^ > {quote} > The error is actually issued by this line of code: > {quote}CatalystSqlParser.parseDataType("100:bigint") > {quote} > > The specific background is that spark calls the above code in the process of > converting the schema of the orc file into the catalyst schema. > {quote}// code in OrcUtils > private def toCatalystSchema(schema: TypeDescription): StructType = > Unknown macro: \{ > CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]) > }{quote} > There are two solutions I currently think of: > # Modify the syntax analysis of SparkSQL to identify this kind of schema > # The TypeDescription.toString method should add the quote symbol to the > numeric column name, because the following syntax is supported: > {quote}CatalystSqlParser.parseDataType("`100`:bigint") > {quote} > But currently TypeDescription does not support changing the UNQUOTED_NAMES > variable, should we first submit a pr to the orc project to support the > configuration of this variable。 > !image-2021-09-03-20-56-28-846.png! > > How do spark members think about this issue? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison
[ https://issues.apache.org/jira/browse/SPARK-39896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39896: Assignee: (was: Apache Spark) > The structural integrity of the plan is broken after > UnwrapCastInBinaryComparison > - > > Key: SPARK-39896 > URL: https://issues.apache.org/jira/browse/SPARK-39896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > sql("create table t1(a decimal(3, 0)) using parquet") > sql("insert into t1 values(100), (10), (1)") > sql("select * from t1 where a in(10, 10, 0, 1.00)").show > {code} > {noformat} > After applying rule > org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch > Operator Optimization before Inferring Filters, the structural integrity of > the plan is broken. > java.lang.RuntimeException: After applying rule > org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch > Operator Optimization before Inferring Filters, the structural integrity of > the plan is broken. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison
[ https://issues.apache.org/jira/browse/SPARK-39896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39896: Assignee: Apache Spark > The structural integrity of the plan is broken after > UnwrapCastInBinaryComparison > - > > Key: SPARK-39896 > URL: https://issues.apache.org/jira/browse/SPARK-39896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > {code:scala} > sql("create table t1(a decimal(3, 0)) using parquet") > sql("insert into t1 values(100), (10), (1)") > sql("select * from t1 where a in(10, 10, 0, 1.00)").show > {code} > {noformat} > After applying rule > org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch > Operator Optimization before Inferring Filters, the structural integrity of > the plan is broken. > java.lang.RuntimeException: After applying rule > org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch > Operator Optimization before Inferring Filters, the structural integrity of > the plan is broken. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison
[ https://issues.apache.org/jira/browse/SPARK-39896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576754#comment-17576754 ] Apache Spark commented on SPARK-39896: -- User 'cfmcgrady' has created a pull request for this issue: https://github.com/apache/spark/pull/37439 > The structural integrity of the plan is broken after > UnwrapCastInBinaryComparison > - > > Key: SPARK-39896 > URL: https://issues.apache.org/jira/browse/SPARK-39896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > sql("create table t1(a decimal(3, 0)) using parquet") > sql("insert into t1 values(100), (10), (1)") > sql("select * from t1 where a in(10, 10, 0, 1.00)").show > {code} > {noformat} > After applying rule > org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch > Operator Optimization before Inferring Filters, the structural integrity of > the plan is broken. > java.lang.RuntimeException: After applying rule > org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch > Operator Optimization before Inferring Filters, the structural integrity of > the plan is broken. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40007) Add Mode to PySpark
[ https://issues.apache.org/jira/browse/SPARK-40007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576747#comment-17576747 ] Apache Spark commented on SPARK-40007: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37438 > Add Mode to PySpark > --- > > Key: SPARK-40007 > URL: https://issues.apache.org/jira/browse/SPARK-40007 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40007) Add Mode to PySpark
[ https://issues.apache.org/jira/browse/SPARK-40007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40007: Assignee: (was: Apache Spark) > Add Mode to PySpark > --- > > Key: SPARK-40007 > URL: https://issues.apache.org/jira/browse/SPARK-40007 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40007) Add Mode to PySpark
[ https://issues.apache.org/jira/browse/SPARK-40007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40007: Assignee: Apache Spark > Add Mode to PySpark > --- > > Key: SPARK-40007 > URL: https://issues.apache.org/jira/browse/SPARK-40007 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40007) Add Mode to PySpark
[ https://issues.apache.org/jira/browse/SPARK-40007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576746#comment-17576746 ] Apache Spark commented on SPARK-40007: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37438 > Add Mode to PySpark > --- > > Key: SPARK-40007 > URL: https://issues.apache.org/jira/browse/SPARK-40007 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40007) Add Mode to PySpark
Ruifeng Zheng created SPARK-40007: - Summary: Add Mode to PySpark Key: SPARK-40007 URL: https://issues.apache.org/jira/browse/SPARK-40007 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40006) Make pyspark.sql.group examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40006: Assignee: (was: Apache Spark) > Make pyspark.sql.group examples self-contained > -- > > Key: SPARK-40006 > URL: https://issues.apache.org/jira/browse/SPARK-40006 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40006) Make pyspark.sql.group examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576744#comment-17576744 ] Apache Spark commented on SPARK-40006: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37437 > Make pyspark.sql.group examples self-contained > -- > > Key: SPARK-40006 > URL: https://issues.apache.org/jira/browse/SPARK-40006 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40006) Make pyspark.sql.group examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40006: Assignee: Apache Spark > Make pyspark.sql.group examples self-contained > -- > > Key: SPARK-40006 > URL: https://issues.apache.org/jira/browse/SPARK-40006 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40005) Self-contained examples with parameter descriptions in PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-40005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40005: - Target Version/s: 3.4.0 > Self-contained examples with parameter descriptions in PySpark documentation > > > Key: SPARK-40005 > URL: https://issues.apache.org/jira/browse/SPARK-40005 > Project: Spark > Issue Type: Umbrella > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA aims to improve PySpark documentation in: > - {{pyspark}} > - {{pyspark.ml}} > - {{pyspark.sql}} > - {{pyspark.sql.streaming}} > We should: > - Make the examples self-contained, e.g., > https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html > - Document {{Parameters}} > https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot. > There are many API that misses parameters in PySpark, e.g., > [DataFrame.union|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.union.html#pyspark.sql.DataFrame.union] > If the size of file is large, e.g., dataframe.py, we should split that down > into each subtask, and improve documentation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries
[ https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576731#comment-17576731 ] Oleksandr Shevchenko commented on SPARK-39995: -- Thanks [~hyukjin.kwon] for your reply. What do you think about support for package managers like [Poetry|https://python-poetry.org/] ? Is it possible to add parameters or add scala version into the package name to be able to install Spark with 2.13 since package managers don't support using env vars to configure it? > PySpark installation doesn't support Scala 2.13 binaries > > > Key: SPARK-39995 > URL: https://issues.apache.org/jira/browse/SPARK-39995 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Oleksandr Shevchenko >Priority: Major > > [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary > [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi] > for Scala 2.13. > Currently, the setup > [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py] > allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror > (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always > Scala 2.12 compatible binaries. There isn't any parameter to download > "spark-3.3.0-bin-hadoop3-scala2.13.tgz". > It's possible to download Spark manually and set the needed SPARK_HOME, but > it's hard to use with pip or Poetry. > Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI > but not possible with package managers like Poetry. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param
[ https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39976: Assignee: Apache Spark > NULL check in ArrayIntersect adds extraneous null from first param > -- > > Key: SPARK-39976 > URL: https://issues.apache.org/jira/browse/SPARK-39976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Navin Kumar >Assignee: Apache Spark >Priority: Major > > This is very likely a regression from SPARK-36829. > When using {{array_intersect(a, b)}}, if the first parameter contains a > {{NULL}} value and the second one does not, an extraneous {{NULL}} is present > in the output. This also leads to {{array_intersect(a, b) != > array_intersect(b, a)}} which is incorrect as set intersection should be > commutative. > Example using PySpark: > {code:python} > >>> a = [1, 2, 3] > >>> b = [3, None, 5] > >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) > >>> df.show() > +-++ > |a| b| > +-++ > |[1, 2, 3]|[3, null, 5]| > +-++ > >>> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > | [3]| > +-+ > >>> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > |[3, null]| > +-+ > {code} > Note that in the first case, {{a}} does not contain a {{NULL}}, and the final > output is correct: {{[3]}}. In the second case, since {{b}} does contain > {{NULL}} and is now the first parameter. > The same behavior occurs in Scala when writing to Parquet: > {code:scala} > scala> val a = Array[java.lang.Integer](1, 2, null, 4) > a: Array[Integer] = Array(1, 2, null, 4) > scala> val b = Array[java.lang.Integer](4, 5, 6, 7) > b: Array[Integer] = Array(4, 5, 6, 7) > scala> val df = Seq((a, b)).toDF("a","b") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.write.parquet("/tmp/simple.parquet") > scala> val df = spark.read.parquet("/tmp/simple.parquet") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.show() > +---++ > | a| b| > +---++ > |[1, 2, null, 4]|[4, 5, 6, 7]| > +---++ > scala> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > |[null, 4]| > +-+ > scala> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > | [4]| > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param
[ https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576730#comment-17576730 ] Apache Spark commented on SPARK-39976: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/37436 > NULL check in ArrayIntersect adds extraneous null from first param > -- > > Key: SPARK-39976 > URL: https://issues.apache.org/jira/browse/SPARK-39976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Navin Kumar >Priority: Major > > This is very likely a regression from SPARK-36829. > When using {{array_intersect(a, b)}}, if the first parameter contains a > {{NULL}} value and the second one does not, an extraneous {{NULL}} is present > in the output. This also leads to {{array_intersect(a, b) != > array_intersect(b, a)}} which is incorrect as set intersection should be > commutative. > Example using PySpark: > {code:python} > >>> a = [1, 2, 3] > >>> b = [3, None, 5] > >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) > >>> df.show() > +-++ > |a| b| > +-++ > |[1, 2, 3]|[3, null, 5]| > +-++ > >>> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > | [3]| > +-+ > >>> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > |[3, null]| > +-+ > {code} > Note that in the first case, {{a}} does not contain a {{NULL}}, and the final > output is correct: {{[3]}}. In the second case, since {{b}} does contain > {{NULL}} and is now the first parameter. > The same behavior occurs in Scala when writing to Parquet: > {code:scala} > scala> val a = Array[java.lang.Integer](1, 2, null, 4) > a: Array[Integer] = Array(1, 2, null, 4) > scala> val b = Array[java.lang.Integer](4, 5, 6, 7) > b: Array[Integer] = Array(4, 5, 6, 7) > scala> val df = Seq((a, b)).toDF("a","b") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.write.parquet("/tmp/simple.parquet") > scala> val df = spark.read.parquet("/tmp/simple.parquet") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.show() > +---++ > | a| b| > +---++ > |[1, 2, null, 4]|[4, 5, 6, 7]| > +---++ > scala> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > |[null, 4]| > +-+ > scala> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > | [4]| > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param
[ https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39976: Assignee: (was: Apache Spark) > NULL check in ArrayIntersect adds extraneous null from first param > -- > > Key: SPARK-39976 > URL: https://issues.apache.org/jira/browse/SPARK-39976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Navin Kumar >Priority: Major > > This is very likely a regression from SPARK-36829. > When using {{array_intersect(a, b)}}, if the first parameter contains a > {{NULL}} value and the second one does not, an extraneous {{NULL}} is present > in the output. This also leads to {{array_intersect(a, b) != > array_intersect(b, a)}} which is incorrect as set intersection should be > commutative. > Example using PySpark: > {code:python} > >>> a = [1, 2, 3] > >>> b = [3, None, 5] > >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) > >>> df.show() > +-++ > |a| b| > +-++ > |[1, 2, 3]|[3, null, 5]| > +-++ > >>> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > | [3]| > +-+ > >>> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > |[3, null]| > +-+ > {code} > Note that in the first case, {{a}} does not contain a {{NULL}}, and the final > output is correct: {{[3]}}. In the second case, since {{b}} does contain > {{NULL}} and is now the first parameter. > The same behavior occurs in Scala when writing to Parquet: > {code:scala} > scala> val a = Array[java.lang.Integer](1, 2, null, 4) > a: Array[Integer] = Array(1, 2, null, 4) > scala> val b = Array[java.lang.Integer](4, 5, 6, 7) > b: Array[Integer] = Array(4, 5, 6, 7) > scala> val df = Seq((a, b)).toDF("a","b") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.write.parquet("/tmp/simple.parquet") > scala> val df = spark.read.parquet("/tmp/simple.parquet") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.show() > +---++ > | a| b| > +---++ > |[1, 2, null, 4]|[4, 5, 6, 7]| > +---++ > scala> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > |[null, 4]| > +-+ > scala> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > | [4]| > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param
[ https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576729#comment-17576729 ] Apache Spark commented on SPARK-39976: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/37436 > NULL check in ArrayIntersect adds extraneous null from first param > -- > > Key: SPARK-39976 > URL: https://issues.apache.org/jira/browse/SPARK-39976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Navin Kumar >Priority: Major > > This is very likely a regression from SPARK-36829. > When using {{array_intersect(a, b)}}, if the first parameter contains a > {{NULL}} value and the second one does not, an extraneous {{NULL}} is present > in the output. This also leads to {{array_intersect(a, b) != > array_intersect(b, a)}} which is incorrect as set intersection should be > commutative. > Example using PySpark: > {code:python} > >>> a = [1, 2, 3] > >>> b = [3, None, 5] > >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) > >>> df.show() > +-++ > |a| b| > +-++ > |[1, 2, 3]|[3, null, 5]| > +-++ > >>> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > | [3]| > +-+ > >>> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > |[3, null]| > +-+ > {code} > Note that in the first case, {{a}} does not contain a {{NULL}}, and the final > output is correct: {{[3]}}. In the second case, since {{b}} does contain > {{NULL}} and is now the first parameter. > The same behavior occurs in Scala when writing to Parquet: > {code:scala} > scala> val a = Array[java.lang.Integer](1, 2, null, 4) > a: Array[Integer] = Array(1, 2, null, 4) > scala> val b = Array[java.lang.Integer](4, 5, 6, 7) > b: Array[Integer] = Array(4, 5, 6, 7) > scala> val df = Seq((a, b)).toDF("a","b") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.write.parquet("/tmp/simple.parquet") > scala> val df = spark.read.parquet("/tmp/simple.parquet") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.show() > +---++ > | a| b| > +---++ > |[1, 2, null, 4]|[4, 5, 6, 7]| > +---++ > scala> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > |[null, 4]| > +-+ > scala> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > | [4]| > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40006) Make pyspark.sql.group examples self-contained
Hyukjin Kwon created SPARK-40006: Summary: Make pyspark.sql.group examples self-contained Key: SPARK-40006 URL: https://issues.apache.org/jira/browse/SPARK-40006 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40005) Self-contained examples with parameter descriptions in PySpark documentation
Hyukjin Kwon created SPARK-40005: Summary: Self-contained examples with parameter descriptions in PySpark documentation Key: SPARK-40005 URL: https://issues.apache.org/jira/browse/SPARK-40005 Project: Spark Issue Type: Umbrella Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Hyukjin Kwon This JIRA aims to improve PySpark documentation in: - {{pyspark}} - {{pyspark.ml}} - {{pyspark.sql}} - {{pyspark.sql.streaming}} We should: - Make the examples self-contained, e.g., https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html - Document {{Parameters}} https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot. There are many API that misses parameters in PySpark, e.g., [DataFrame.union|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.union.html#pyspark.sql.DataFrame.union] If the size of file is large, e.g., dataframe.py, we should split that down into each subtask, and improve documentation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39973) Avoid noisy warnings logs when spark.scheduler.listenerbus.metrics.maxListenerClassesTimed = 0
[ https://issues.apache.org/jira/browse/SPARK-39973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39973. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37432 [https://github.com/apache/spark/pull/37432] > Avoid noisy warnings logs when > spark.scheduler.listenerbus.metrics.maxListenerClassesTimed = 0 > -- > > Key: SPARK-39973 > URL: https://issues.apache.org/jira/browse/SPARK-39973 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > If {{spark.scheduler.listenerbus.metrics.maxListenerClassesTimed}} has been > set to {{0}} to disable listener timers then listener registration will > trigger noisy warnings like > {code:java} > LiveListenerBusMetrics: Not measuring processing time for listener class > org.apache.spark.sql.util.ExecutionListenerBus because a maximum of 0 > listener classes are already timed.{code} > warnings. > We should change the code to not print this warning when > maxListenerClassesTimed = 0. > I don't plan to work on this myself. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39973) Avoid noisy warnings logs when spark.scheduler.listenerbus.metrics.maxListenerClassesTimed = 0
[ https://issues.apache.org/jira/browse/SPARK-39973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39973: Assignee: Hyukjin Kwon > Avoid noisy warnings logs when > spark.scheduler.listenerbus.metrics.maxListenerClassesTimed = 0 > -- > > Key: SPARK-39973 > URL: https://issues.apache.org/jira/browse/SPARK-39973 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Assignee: Hyukjin Kwon >Priority: Minor > > If {{spark.scheduler.listenerbus.metrics.maxListenerClassesTimed}} has been > set to {{0}} to disable listener timers then listener registration will > trigger noisy warnings like > {code:java} > LiveListenerBusMetrics: Not measuring processing time for listener class > org.apache.spark.sql.util.ExecutionListenerBus because a maximum of 0 > listener classes are already timed.{code} > warnings. > We should change the code to not print this warning when > maxListenerClassesTimed = 0. > I don't plan to work on this myself. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39753) Broadcast joins should pushdown join constraints as Filter to the larger relation
[ https://issues.apache.org/jira/browse/SPARK-39753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576643#comment-17576643 ] Nick Dimiduk commented on SPARK-39753: -- Linking to the original issue. > Broadcast joins should pushdown join constraints as Filter to the larger > relation > - > > Key: SPARK-39753 > URL: https://issues.apache.org/jira/browse/SPARK-39753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: Victor Delépine >Priority: Major > > SPARK-19609 was bulk-closed a while ago, but not fixed. I've decided to > re-open it here for more visibility, since I believe this bug has a major > impact and that fixing it could drastically improve the performance of many > pipelines. > Allow me to paste the initial description again here: > _For broadcast inner-joins, where the smaller relation is known to be small > enough to materialize on a worker, the set of values for all join columns is > known and fits in memory. Spark should translate these values into a > {{Filter}} pushed down to the datasource. The common join condition of > equality, i.e. {{{}lhs.a == rhs.a{}}}, can be written as an {{a in ...}} > clause. An example of pushing such filters is already present in the form of > {{IsNotNull}} filters via_ [~sameerag]{_}'s work on SPARK-12957 subtasks.{_} > _This optimization could even work when the smaller relation does not fit > entirely in memory. This could be done by partitioning the smaller relation > into N pieces, applying this predicate pushdown for each piece, and unioning > the results._ > > Essentially, when doing a Broadcast join, the smaller side can be used to > filter down the bigger side before performing the join. As of today, the join > will read all partitions of the bigger side, without pruning partitions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40004) Redundant `LevelDB.get` in `RemoteBlockPushResolver`
[ https://issues.apache.org/jira/browse/SPARK-40004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40004: Assignee: Apache Spark > Redundant `LevelDB.get` in `RemoteBlockPushResolver` > > > Key: SPARK-40004 > URL: https://issues.apache.org/jira/browse/SPARK-40004 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > {code:java} > void removeAppAttemptPathInfoFromDB(String appId, int attemptId) { > AppAttemptId appAttemptId = new AppAttemptId(appId, attemptId); > if (db != null) { > try { > byte[] key = getDbAppAttemptPathsKey(appAttemptId); > if (db.get(key) != null) { > db.delete(key); > } > } catch (Exception e) { > logger.error("Failed to remove the application attempt {} local path in > DB", > appAttemptId, e); > } > } > } > {code} > No need to check `db.get(key) != null` before delete. LevelDB will handle > this scene. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org