date:20220808

[jira] [Commented] (SPARK-40016) Remove unnecessary TryEval in TrySum

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577203#comment-17577203
 ] 

Apache Spark commented on SPARK-40016:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37446

> Remove unnecessary TryEval in TrySum
> 
>
> Key: SPARK-40016
> URL: https://issues.apache.org/jira/browse/SPARK-40016
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> Remove unnecessary TryEval in TrySum for simplicity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40016) Remove unnecessary TryEval in TrySum

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40016:


Assignee: Gengliang Wang  (was: Apache Spark)

> Remove unnecessary TryEval in TrySum
> 
>
> Key: SPARK-40016
> URL: https://issues.apache.org/jira/browse/SPARK-40016
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> Remove unnecessary TryEval in TrySum for simplicity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40016) Remove unnecessary TryEval in TrySum

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40016:


Assignee: Apache Spark  (was: Gengliang Wang)

> Remove unnecessary TryEval in TrySum
> 
>
> Key: SPARK-40016
> URL: https://issues.apache.org/jira/browse/SPARK-40016
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Remove unnecessary TryEval in TrySum for simplicity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40016) Remove unnecessary TryEval in TrySum

2022-08-08 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-40016:
--

 Summary: Remove unnecessary TryEval in TrySum
 Key: SPARK-40016
 URL: https://issues.apache.org/jira/browse/SPARK-40016
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Remove unnecessary TryEval in TrySum for simplicity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40015) Add sc.listArchives and sc.listFiles to PySpark

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40015:


Assignee: Apache Spark

> Add sc.listArchives and sc.listFiles to PySpark 
> 
>
> Key: SPARK-40015
> URL: https://issues.apache.org/jira/browse/SPARK-40015
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40015) Add sc.listArchives and sc.listFiles to PySpark

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40015:


Assignee: (was: Apache Spark)

> Add sc.listArchives and sc.listFiles to PySpark 
> 
>
> Key: SPARK-40015
> URL: https://issues.apache.org/jira/browse/SPARK-40015
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40015) Add sc.listArchives and sc.listFiles to PySpark

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577188#comment-17577188
 ] 

Apache Spark commented on SPARK-40015:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37445

> Add sc.listArchives and sc.listFiles to PySpark 
> 
>
> Key: SPARK-40015
> URL: https://issues.apache.org/jira/browse/SPARK-40015
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40015) Add sc.listArchives and sc.listFiles to PySpark

2022-08-08 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-40015:
-

 Summary: Add sc.listArchives and sc.listFiles to PySpark 
 Key: SPARK-40015
 URL: https://issues.apache.org/jira/browse/SPARK-40015
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39752) Spark job failed with 10M rows data with Broken pipe error

2022-08-08 Thread SHOBHIT SHUKLA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHOBHIT SHUKLA updated SPARK-39752:
---
Attachment: (was: Failed_spark_job_3.0.3.txt)

> Spark job failed with 10M rows data with Broken pipe error
> --
>
> Key: SPARK-39752
> URL: https://issues.apache.org/jira/browse/SPARK-39752
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.3, 3.2.1
>Reporter: SHOBHIT SHUKLA
>Priority: Major
> Fix For: 3.0.2
>
>
> Spark job failed with 10M rows data with Broken pipe error. Same spark job 
> was working previously with the settings "executor_cores": 1, 
> "executor_memory": 1, "driver_cores": 1, "driver_memory": 1. where as the 
> same job is failing with spark settings in 3.0.3 and 3.2.1.
> Major symptoms (slowness, timeout, out of memory as examples): Spark job is 
> failing with the error java.net.SocketException: Broken pipe (Write failed)
> Here are the spark settings information which is working on Spark 3.0.3 and 
> 3.2.1 : "executor_cores": 4, "executor_memory": 4, "driver_cores": 4, 
> "driver_memory": 4.. The spark job doesn't consistently works with the above 
> settings. Some times, need to increase the cores and memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39752) Spark job failed with 10M rows data with Broken pipe error

2022-08-08 Thread SHOBHIT SHUKLA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHOBHIT SHUKLA updated SPARK-39752:
---
Attachment: (was: spark_job_success_3.0.2.txt)

> Spark job failed with 10M rows data with Broken pipe error
> --
>
> Key: SPARK-39752
> URL: https://issues.apache.org/jira/browse/SPARK-39752
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.3, 3.2.1
>Reporter: SHOBHIT SHUKLA
>Priority: Major
> Fix For: 3.0.2
>
>
> Spark job failed with 10M rows data with Broken pipe error. Same spark job 
> was working previously with the settings "executor_cores": 1, 
> "executor_memory": 1, "driver_cores": 1, "driver_memory": 1. where as the 
> same job is failing with spark settings in 3.0.3 and 3.2.1.
> Major symptoms (slowness, timeout, out of memory as examples): Spark job is 
> failing with the error java.net.SocketException: Broken pipe (Write failed)
> Here are the spark settings information which is working on Spark 3.0.3 and 
> 3.2.1 : "executor_cores": 4, "executor_memory": 4, "driver_cores": 4, 
> "driver_memory": 4.. The spark job doesn't consistently works with the above 
> settings. Some times, need to increase the cores and memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38699) Use error classes in the execution errors of dictionary encoding

2022-08-08 Thread Goutam Ghosh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577170#comment-17577170
 ] 

Goutam Ghosh commented on SPARK-38699:
--

[~maxgekk] can you please review the comments for pull request 
https://github.com/apache/spark/pull/37065
and advice should I remove the assertion and use the error classes for this 
change


> Use error classes in the execution errors of dictionary encoding
> 
>
> Key: SPARK-38699
> URL: https://issues.apache.org/jira/browse/SPARK-38699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * useDictionaryEncodingWhenDictionaryOverflowError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39863) Upgrade Hadoop to 3.3.4

2022-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-39863.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37281
[https://github.com/apache/spark/pull/37281]

> Upgrade Hadoop to 3.3.4
> ---
>
> Key: SPARK-39863
> URL: https://issues.apache.org/jira/browse/SPARK-39863
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> This JIRA tracks the progress of upgrading Hadoop dependency to 3.3.4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39863) Upgrade Hadoop to 3.3.4

2022-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-39863:
-

Assignee: Chao Sun

> Upgrade Hadoop to 3.3.4
> ---
>
> Key: SPARK-39863
> URL: https://issues.apache.org/jira/browse/SPARK-39863
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> This JIRA tracks the progress of upgrading Hadoop dependency to 3.3.4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40014) Support cast of decimals to ANSI intervals

2022-08-08 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40014:
-
Description: Support casts of decimal to ANSI intervals, and preserve the 
fractional parts of seconds in the casts.  (was: Support casts of ANSI 
intervals to decimal, and preserve the fractional parts of seconds in the 
casts.)

> Support cast of decimals to ANSI intervals
> --
>
> Key: SPARK-40014
> URL: https://issues.apache.org/jira/browse/SPARK-40014
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Support casts of decimal to ANSI intervals, and preserve the fractional parts 
> of seconds in the casts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40014) Support cast of decimals to ANSI intervals

2022-08-08 Thread Max Gekk (Jira)

Max Gekk created SPARK-40014:


 Summary: Support cast of decimals to ANSI intervals
 Key: SPARK-40014
 URL: https://issues.apache.org/jira/browse/SPARK-40014
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk
 Fix For: 3.4.0


Support casts of ANSI intervals to decimal, and preserve the fractional parts 
of seconds in the casts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40014) Support cast of decimals to ANSI intervals

2022-08-08 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40014:


Assignee: (was: Max Gekk)

> Support cast of decimals to ANSI intervals
> --
>
> Key: SPARK-40014
> URL: https://issues.apache.org/jira/browse/SPARK-40014
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Support casts of decimal to ANSI intervals, and preserve the fractional parts 
> of seconds in the casts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39470) Support cast of ANSI intervals to decimals

2022-08-08 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39470:
-
Parent: SPARK-27790
Issue Type: Sub-task  (was: New Feature)

> Support cast of ANSI intervals to decimals
> --
>
> Key: SPARK-39470
> URL: https://issues.apache.org/jira/browse/SPARK-39470
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Support casts of ANSI intervals to decimal, and preserve the fractional parts 
> of seconds in the casts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39451) Support casting intervals to integrals in ANSI mode

2022-08-08 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39451:
-
Parent: SPARK-27790
Issue Type: Sub-task  (was: New Feature)

> Support casting intervals to integrals in ANSI mode
> ---
>
> Key: SPARK-39451
> URL: https://issues.apache.org/jira/browse/SPARK-39451
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: Screenshot 2022-06-12 at 13.04.44.png
>
>
> To conform the SQL standard, support casting of interval types to *INT, see 
> the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40008) Support casting integrals to intervals in ANSI mode

2022-08-08 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40008:
-
Parent: SPARK-27790
Issue Type: Sub-task  (was: New Feature)

> Support casting integrals to intervals in ANSI mode
> ---
>
> Key: SPARK-40008
> URL: https://issues.apache.org/jira/browse/SPARK-40008
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: Screenshot 2022-06-12 at 13.04.44.png
>
>
> To conform the SQL standard, support casting of interval types to *INT, see 
> the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40008) Support casting integrals to intervals in ANSI mode

2022-08-08 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40008.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37442
[https://github.com/apache/spark/pull/37442]

> Support casting integrals to intervals in ANSI mode
> ---
>
> Key: SPARK-40008
> URL: https://issues.apache.org/jira/browse/SPARK-40008
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: Screenshot 2022-06-12 at 13.04.44.png
>
>
> To conform the SQL standard, support casting of interval types to *INT, see 
> the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40013) DS V2 expressions should have the default implementation of toString

2022-08-08 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng resolved SPARK-40013.

Resolution: Won't Fix

> DS V2 expressions should have the default implementation of toString
> 
>
> Key: SPARK-40013
> URL: https://issues.apache.org/jira/browse/SPARK-40013
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, V2 expressions missing the default toString and lead to unexpected 
> result.
> We should add a default implementation in the base class Expression using 
> ToStringSQLBuilder.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40012:


Assignee: Apache Spark

> Make pyspark.sql.dataframe examples self-contained
> --
>
> Key: SPARK-40012
> URL: https://issues.apache.org/jira/browse/SPARK-40012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: William Zijie Zhang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577127#comment-17577127
 ] 

Apache Spark commented on SPARK-40012:
--

User 'Transurgeon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37444

> Make pyspark.sql.dataframe examples self-contained
> --
>
> Key: SPARK-40012
> URL: https://issues.apache.org/jira/browse/SPARK-40012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: William Zijie Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40012:


Assignee: (was: Apache Spark)

> Make pyspark.sql.dataframe examples self-contained
> --
>
> Key: SPARK-40012
> URL: https://issues.apache.org/jira/browse/SPARK-40012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: William Zijie Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40013) DS V2 expressions should have the default implementation of toString

2022-08-08 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-40013:
--

 Summary: DS V2 expressions should have the default implementation 
of toString
 Key: SPARK-40013
 URL: https://issues.apache.org/jira/browse/SPARK-40013
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


Currently, V2 expressions missing the default toString and lead to unexpected 
result.
We should add a default implementation in the base class Expression using 
ToStringSQLBuilder.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained

2022-08-08 Thread William Zijie Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Zijie Zhang updated SPARK-40012:

Priority: Major  (was: Minor)

> Make pyspark.sql.dataframe examples self-contained
> --
>
> Key: SPARK-40012
> URL: https://issues.apache.org/jira/browse/SPARK-40012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: William Zijie Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained

2022-08-08 Thread William Zijie Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Zijie Zhang updated SPARK-40012:

Component/s: PySpark

> Make pyspark.sql.dataframe examples self-contained
> --
>
> Key: SPARK-40012
> URL: https://issues.apache.org/jira/browse/SPARK-40012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: William Zijie Zhang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained

2022-08-08 Thread William Zijie Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Zijie Zhang updated SPARK-40012:

Affects Version/s: 3.4.0
   (was: 3.3.0)

> Make pyspark.sql.dataframe examples self-contained
> --
>
> Key: SPARK-40012
> URL: https://issues.apache.org/jira/browse/SPARK-40012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: William Zijie Zhang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40012) Make pyspark.sql.group examples self-contained

2022-08-08 Thread William Zijie Zhang (Jira)

William Zijie Zhang created SPARK-40012:
---

 Summary: Make pyspark.sql.group examples self-contained
 Key: SPARK-40012
 URL: https://issues.apache.org/jira/browse/SPARK-40012
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.3.0
Reporter: William Zijie Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained

2022-08-08 Thread William Zijie Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Zijie Zhang updated SPARK-40012:

Summary: Make pyspark.sql.dataframe examples self-contained  (was: Make 
pyspark.sql.group examples self-contained)

> Make pyspark.sql.dataframe examples self-contained
> --
>
> Key: SPARK-40012
> URL: https://issues.apache.org/jira/browse/SPARK-40012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: William Zijie Zhang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39819) DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39819.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37320
[https://github.com/apache/spark/pull/37320]

> DS V2 aggregate push down can work with Top N or Paging (Sort with group 
> expressions)
> -
>
> Key: SPARK-39819
> URL: https://issues.apache.org/jira/browse/SPARK-39819
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, DS V2 aggregate push-down cannot work with Top N (order by ... 
> limit ...) or Paging (order by ... limit ... offset ...).
> If it can work with Top N or Paging, it will be better performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39819) DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-08-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39819:
---

Assignee: jiaan.geng

> DS V2 aggregate push down can work with Top N or Paging (Sort with group 
> expressions)
> -
>
> Key: SPARK-39819
> URL: https://issues.apache.org/jira/browse/SPARK-39819
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Currently, DS V2 aggregate push-down cannot work with Top N (order by ... 
> limit ...) or Paging (order by ... limit ... offset ...).
> If it can work with Top N or Paging, it will be better performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40010) Make pyspark.sql.window examples self-contained

2022-08-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40010:
-
Summary: Make pyspark.sql.window examples self-contained  (was: Make 
pyspark.sql.windown examples self-contained)

> Make pyspark.sql.window examples self-contained
> ---
>
> Key: SPARK-40010
> URL: https://issues.apache.org/jira/browse/SPARK-40010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40011) Pandas API on Spark requires Pandas

2022-08-08 Thread Daniel Oakley (Jira)

Daniel Oakley created SPARK-40011:
-

 Summary: Pandas API on Spark requires Pandas
 Key: SPARK-40011
 URL: https://issues.apache.org/jira/browse/SPARK-40011
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.3.0
Reporter: Daniel Oakley
 Fix For: 3.3.1


Pandas API on Spark includes code like:

> import pandas as pd
> from pandas.api.types import is_hashable, is_list_like  # type: 
> ignore[attr-defined]

This breaks if you don't have pandas installed on your Spark cluster.

Pandas API was supposed to be an API not pandas integration, why does it 
require pandas to be installed?

In many places Spark jobs may be run on various Spark clusters with no 
assurance of particular Python packages installed at a root level. 

Can this dependency be removed? Or the required version of Pandas be bundled 
with the Spark distribution? Similar for numpy and other deps.

If not the docs should clearly state it is not merely a Spark API that mirror 
the Pandas API, but something quite different.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40007) Add Mode to PySpark

2022-08-08 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40007.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37438
[https://github.com/apache/spark/pull/37438]

> Add Mode to PySpark
> ---
>
> Key: SPARK-40007
> URL: https://issues.apache.org/jira/browse/SPARK-40007
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40007) Add Mode to PySpark

2022-08-08 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40007:
-

Assignee: Ruifeng Zheng

> Add Mode to PySpark
> ---
>
> Key: SPARK-40007
> URL: https://issues.apache.org/jira/browse/SPARK-40007
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40010) Make pyspark.sql.windown examples self-contained

2022-08-08 Thread Qian Sun (Jira)

Qian Sun created SPARK-40010:


 Summary: Make pyspark.sql.windown examples self-contained
 Key: SPARK-40010
 URL: https://issues.apache.org/jira/browse/SPARK-40010
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Qian Sun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40006) Make pyspark.sql.group examples self-contained

2022-08-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40006:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Make pyspark.sql.group examples self-contained
> --
>
> Key: SPARK-40006
> URL: https://issues.apache.org/jira/browse/SPARK-40006
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40006) Make pyspark.sql.group examples self-contained

2022-08-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40006.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37437
[https://github.com/apache/spark/pull/37437]

> Make pyspark.sql.group examples self-contained
> --
>
> Key: SPARK-40006
> URL: https://issues.apache.org/jira/browse/SPARK-40006
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40006) Make pyspark.sql.group examples self-contained

2022-08-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40006:


Assignee: Apache Spark

> Make pyspark.sql.group examples self-contained
> --
>
> Key: SPARK-40006
> URL: https://issues.apache.org/jira/browse/SPARK-40006
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40002) Limit improperly pushed down through window using ntile function

2022-08-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40002.
--
Fix Version/s: 3.3.1
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37443
[https://github.com/apache/spark/pull/37443]

> Limit improperly pushed down through window using ntile function
> 
>
> Key: SPARK-40002
> URL: https://issues.apache.org/jira/browse/SPARK-40002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.3.1, 3.2.3, 3.4.0
>
>
> Limit is pushed down through a window using the ntile function, which causes 
> results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions 
> of Spark (e.g., 3.1.3).
> Assume this data:
> {noformat}
> create table t1 stored as parquet as
> select *
> from range(101);
> {noformat}
> Also assume this query:
> {noformat}
> select id, ntile(10) over (order by id) as nt
> from t1
> limit 10;
> {noformat}
> Spark 3.2.2, Spark 3.3.0, and master produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |2  |
> |2  |3  |
> |3  |4  |
> |4  |5  |
> |5  |6  |
> |6  |7  |
> |7  |8  |
> |8  |9  |
> |9  |10 |
> +---+---+
> {noformat}
> However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |1  |
> |2  |1  |
> |3  |1  |
> |4  |1  |
> |5  |1  |
> |6  |1  |
> |7  |1  |
> |8  |1  |
> |9  |1  |
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39994) How to write (save) PySpark dataframe containing vector column?

2022-08-08 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577096#comment-17577096
 ] 

Hyukjin Kwon commented on SPARK-39994:
--

It has to be included hadoop instead of spark.

> How to write (save) PySpark dataframe containing vector column?
> ---
>
> Key: SPARK-39994
> URL: https://issues.apache.org/jira/browse/SPARK-39994
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Muhammad Kaleem Ullah
>Priority: Major
> Attachments: df.PNG, error.PNG
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I'm trying to same the PySpark dataframe after transforming it using ML 
> Pipeline. But when I save it the weird error is triggered every time. Here 
> are the columns of this dataframe:
> |-- label: integer (nullable = true)
> |-- dest_index: double (nullable = false)
> |-- dest_fact: vector (nullable = true)
> |-- carrier_index: double (nullable = false)
> |-- carrier_fact: vector (nullable = true)
> |-- features: vector (nullable = true)
> And the following error occurs when trying to save this dataframe that 
> contains vector data:
> {code:java}
> // training.write.parquet("training_files.parquet", mode = "overwrite") {code}
> {noformat}
> Py4JJavaError: An error occurred while calling o440.parquet. : 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:278)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
> ...
> {noformat}
>  
> I tried to use differently available {{winutils}} for Hadoop from [this 
> GitHub repository|https://github.com/cdarlint/winutils] but with not much 
> luck. Please help me in this regard. How can I save this dataframe so that I 
> can read it in any other jupyter notebook file? Feel free to ask any 
> questions. Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40002) Limit improperly pushed down through window using ntile function

2022-08-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40002:


Assignee: Bruce Robbins

> Limit improperly pushed down through window using ntile function
> 
>
> Key: SPARK-40002
> URL: https://issues.apache.org/jira/browse/SPARK-40002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Limit is pushed down through a window using the ntile function, which causes 
> results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions 
> of Spark (e.g., 3.1.3).
> Assume this data:
> {noformat}
> create table t1 stored as parquet as
> select *
> from range(101);
> {noformat}
> Also assume this query:
> {noformat}
> select id, ntile(10) over (order by id) as nt
> from t1
> limit 10;
> {noformat}
> Spark 3.2.2, Spark 3.3.0, and master produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |2  |
> |2  |3  |
> |3  |4  |
> |4  |5  |
> |5  |6  |
> |6  |7  |
> |7  |8  |
> |8  |9  |
> |9  |10 |
> +---+---+
> {noformat}
> However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |1  |
> |2  |1  |
> |3  |1  |
> |4  |1  |
> |5  |1  |
> |6  |1  |
> |7  |1  |
> |8  |1  |
> |9  |1  |
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries

2022-08-08 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577057#comment-17577057
 ] 

Haejoon Lee edited comment on SPARK-39995 at 8/9/22 12:37 AM:
--

Let me take a look


was (Author: itholic):
Let ma take a look

> PySpark installation doesn't support Scala 2.13 binaries
> 
>
> Key: SPARK-39995
> URL: https://issues.apache.org/jira/browse/SPARK-39995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Oleksandr Shevchenko
>Priority: Major
>
> [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary 
> [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi]
>  for Scala 2.13.
> Currently, the setup 
> [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py]
>  allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror 
> (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always 
> Scala 2.12 compatible binaries. There isn't any parameter to download 
> "spark-3.3.0-bin-hadoop3-scala2.13.tgz".
> It's possible to download Spark manually and set the needed SPARK_HOME, but 
> it's hard to use with pip or Poetry.
> Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI 
> but not possible with package managers like Poetry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries

2022-08-08 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577057#comment-17577057
 ] 

Haejoon Lee commented on SPARK-39995:
-

Let ma take a look

> PySpark installation doesn't support Scala 2.13 binaries
> 
>
> Key: SPARK-39995
> URL: https://issues.apache.org/jira/browse/SPARK-39995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Oleksandr Shevchenko
>Priority: Major
>
> [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary 
> [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi]
>  for Scala 2.13.
> Currently, the setup 
> [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py]
>  allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror 
> (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always 
> Scala 2.12 compatible binaries. There isn't any parameter to download 
> "spark-3.3.0-bin-hadoop3-scala2.13.tgz".
> It's possible to download Spark manually and set the needed SPARK_HOME, but 
> it's hard to use with pip or Poetry.
> Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI 
> but not possible with package managers like Poetry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40003) Add median to PySpark

2022-08-08 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40003:
-

Assignee: Ruifeng Zheng

> Add median to PySpark
> -
>
> Key: SPARK-40003
> URL: https://issues.apache.org/jira/browse/SPARK-40003
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40003) Add median to PySpark

2022-08-08 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40003.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37434
[https://github.com/apache/spark/pull/37434]

> Add median to PySpark
> -
>
> Key: SPARK-40003
> URL: https://issues.apache.org/jira/browse/SPARK-40003
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40002) Limit improperly pushed down through window using ntile function

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576973#comment-17576973
 ] 

Apache Spark commented on SPARK-40002:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/37443

> Limit improperly pushed down through window using ntile function
> 
>
> Key: SPARK-40002
> URL: https://issues.apache.org/jira/browse/SPARK-40002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Limit is pushed down through a window using the ntile function, which causes 
> results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions 
> of Spark (e.g., 3.1.3).
> Assume this data:
> {noformat}
> create table t1 stored as parquet as
> select *
> from range(101);
> {noformat}
> Also assume this query:
> {noformat}
> select id, ntile(10) over (order by id) as nt
> from t1
> limit 10;
> {noformat}
> Spark 3.2.2, Spark 3.3.0, and master produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |2  |
> |2  |3  |
> |3  |4  |
> |4  |5  |
> |5  |6  |
> |6  |7  |
> |7  |8  |
> |8  |9  |
> |9  |10 |
> +---+---+
> {noformat}
> However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |1  |
> |2  |1  |
> |3  |1  |
> |4  |1  |
> |5  |1  |
> |6  |1  |
> |7  |1  |
> |8  |1  |
> |9  |1  |
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40002) Limit improperly pushed down through window using ntile function

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40002:


Assignee: Apache Spark

> Limit improperly pushed down through window using ntile function
> 
>
> Key: SPARK-40002
> URL: https://issues.apache.org/jira/browse/SPARK-40002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> Limit is pushed down through a window using the ntile function, which causes 
> results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions 
> of Spark (e.g., 3.1.3).
> Assume this data:
> {noformat}
> create table t1 stored as parquet as
> select *
> from range(101);
> {noformat}
> Also assume this query:
> {noformat}
> select id, ntile(10) over (order by id) as nt
> from t1
> limit 10;
> {noformat}
> Spark 3.2.2, Spark 3.3.0, and master produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |2  |
> |2  |3  |
> |3  |4  |
> |4  |5  |
> |5  |6  |
> |6  |7  |
> |7  |8  |
> |8  |9  |
> |9  |10 |
> +---+---+
> {noformat}
> However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |1  |
> |2  |1  |
> |3  |1  |
> |4  |1  |
> |5  |1  |
> |6  |1  |
> |7  |1  |
> |8  |1  |
> |9  |1  |
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40002) Limit improperly pushed down through window using ntile function

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40002:


Assignee: (was: Apache Spark)

> Limit improperly pushed down through window using ntile function
> 
>
> Key: SPARK-40002
> URL: https://issues.apache.org/jira/browse/SPARK-40002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Limit is pushed down through a window using the ntile function, which causes 
> results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions 
> of Spark (e.g., 3.1.3).
> Assume this data:
> {noformat}
> create table t1 stored as parquet as
> select *
> from range(101);
> {noformat}
> Also assume this query:
> {noformat}
> select id, ntile(10) over (order by id) as nt
> from t1
> limit 10;
> {noformat}
> Spark 3.2.2, Spark 3.3.0, and master produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |2  |
> |2  |3  |
> |3  |4  |
> |4  |5  |
> |5  |6  |
> |6  |7  |
> |7  |8  |
> |8  |9  |
> |9  |10 |
> +---+---+
> {noformat}
> However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |1  |
> |2  |1  |
> |3  |1  |
> |4  |1  |
> |5  |1  |
> |6  |1  |
> |7  |1  |
> |8  |1  |
> |9  |1  |
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40002) Limit improperly pushed down through window using ntile function

2022-08-08 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-40002:
--
Labels: correctness  (was: )

> Limit improperly pushed down through window using ntile function
> 
>
> Key: SPARK-40002
> URL: https://issues.apache.org/jira/browse/SPARK-40002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Limit is pushed down through a window using the ntile function, which causes 
> results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions 
> of Spark (e.g., 3.1.3).
> Assume this data:
> {noformat}
> create table t1 stored as parquet as
> select *
> from range(101);
> {noformat}
> Also assume this query:
> {noformat}
> select id, ntile(10) over (order by id) as nt
> from t1
> limit 10;
> {noformat}
> Spark 3.2.2, Spark 3.3.0, and master produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |2  |
> |2  |3  |
> |3  |4  |
> |4  |5  |
> |5  |6  |
> |6  |7  |
> |7  |8  |
> |8  |9  |
> |9  |10 |
> +---+---+
> {noformat}
> However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |1  |
> |2  |1  |
> |3  |1  |
> |4  |1  |
> |5  |1  |
> |6  |1  |
> |7  |1  |
> |8  |1  |
> |9  |1  |
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40002) Limit improperly pushed down through window using ntile function

2022-08-08 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-40002:
--
Summary: Limit improperly pushed down through window using ntile function  
(was: Limit pushed down through window using ntile function)

> Limit improperly pushed down through window using ntile function
> 
>
> Key: SPARK-40002
> URL: https://issues.apache.org/jira/browse/SPARK-40002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bruce Robbins
>Priority: Major
>
> Limit is pushed down through a window using the ntile function, which causes 
> results that differ from Hive 2.3.9, and Prestodb 0.268, and older versions 
> of Spark (e.g., 3.1.3).
> Assume this data:
> {noformat}
> create table t1 stored as parquet as
> select *
> from range(101);
> {noformat}
> Also assume this query:
> {noformat}
> select id, ntile(10) over (order by id) as nt
> from t1
> limit 10;
> {noformat}
> Spark 3.2.2, Spark 3.3.0, and master produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |2  |
> |2  |3  |
> |3  |4  |
> |4  |5  |
> |5  |6  |
> |6  |7  |
> |7  |8  |
> |8  |9  |
> |9  |10 |
> +---+---+
> {noformat}
> However, Spark 3.1.3, Hive 2.3.9, and Prestodb 0.268 produce the following:
> {noformat}
> +---+---+
> |id |nt |
> +---+---+
> |0  |1  |
> |1  |1  |
> |2  |1  |
> |3  |1  |
> |4  |1  |
> |5  |1  |
> |6  |1  |
> |7  |1  |
> |8  |1  |
> |9  |1  |
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40004) Redundant `LevelDB.get` in `RemoteBlockPushResolver`

2022-08-08 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-40004.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37435
[https://github.com/apache/spark/pull/37435]

> Redundant `LevelDB.get` in `RemoteBlockPushResolver`
> 
>
> Key: SPARK-40004
> URL: https://issues.apache.org/jira/browse/SPARK-40004
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> {code:java}
> void removeAppAttemptPathInfoFromDB(String appId, int attemptId) {
>   AppAttemptId appAttemptId = new AppAttemptId(appId, attemptId);
>   if (db != null) {
> try {
>   byte[] key = getDbAppAttemptPathsKey(appAttemptId);
>   if (db.get(key) != null) {
> db.delete(key);
>   }
> } catch (Exception e) {
>   logger.error("Failed to remove the application attempt {} local path in 
> DB",
>   appAttemptId, e);
> }
>   }
> }
>  {code}
> No need to check `db.get(key) != null` before delete. LevelDB will handle 
> this scene.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40004) Redundant `LevelDB.get` in `RemoteBlockPushResolver`

2022-08-08 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-40004:
---

Assignee: Yang Jie

> Redundant `LevelDB.get` in `RemoteBlockPushResolver`
> 
>
> Key: SPARK-40004
> URL: https://issues.apache.org/jira/browse/SPARK-40004
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> {code:java}
> void removeAppAttemptPathInfoFromDB(String appId, int attemptId) {
>   AppAttemptId appAttemptId = new AppAttemptId(appId, attemptId);
>   if (db != null) {
> try {
>   byte[] key = getDbAppAttemptPathsKey(appAttemptId);
>   if (db.get(key) != null) {
> db.delete(key);
>   }
> } catch (Exception e) {
>   logger.error("Failed to remove the application attempt {} local path in 
> DB",
>   appAttemptId, e);
> }
>   }
> }
>  {code}
> No need to check `db.get(key) != null` before delete. LevelDB will handle 
> this scene.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39993) Spark on Kubernetes doesn't filter data by date

2022-08-08 Thread Hanna Liashchuk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanna Liashchuk updated SPARK-39993:

Description: 
I'm creating a Dataset with type date and saving it into s3. When I read it and 
try to use where() clause, I've noticed it doesn't return data even though it's 
there

Below is the code snippet I'm running

 
{code:java}
from pyspark.sql.types import Row
from pyspark.sql.functions import *
ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", 
col("date").cast("date"))
ds.where("date = '2022-01-01'").show()
ds.write.mode("overwrite").parquet("s3a://bucket/test")
df = spark.read.format("parquet").load("s3a://bucket/test")
df.where("date = '2022-01-01'").show()
{code}
The first show() returns data, while the second one - no.

I've noticed that it's Kubernetes master related, as the same code snipped 
works ok with master "local"

UPD: if the column is used as a partition and has the type "date" or is de 
facto date but has the type "string", there is no filtering problem.

 

 

  was:
I'm creating a Dataset with type date and saving it into s3. When I read it and 
try to use where() clause, I've noticed it doesn't return data even though it's 
there

Below is the code snippet I'm running

 
{code:java}
from pyspark.sql.types import Row
from pyspark.sql.functions import *
ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", 
col("date").cast("date"))
ds.where("date = '2022-01-01'").show()
ds.write.mode("overwrite").parquet("s3a://bucket/test")
df = spark.read.format("parquet").load("s3a://bucket/test")
df.where("date = '2022-01-01'").show()
{code}
The first show() returns data, while the second one - no.

I've noticed that it's Kubernetes master related, as the same code snipped 
works ok with master "local"

UPD: if the column is used as a partition and has the type "date", there is no 
filtering problem. 

 

 


> Spark on Kubernetes doesn't filter data by date
> ---
>
> Key: SPARK-39993
> URL: https://issues.apache.org/jira/browse/SPARK-39993
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.2
> Environment: Kubernetes v1.23.6
> Spark 3.2.2
> Java 1.8.0_312
> Python 3.9.13
> Aws dependencies:
> aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar
>Reporter: Hanna Liashchuk
>Priority: Major
>  Labels: kubernetes
>
> I'm creating a Dataset with type date and saving it into s3. When I read it 
> and try to use where() clause, I've noticed it doesn't return data even 
> though it's there
> Below is the code snippet I'm running
>  
> {code:java}
> from pyspark.sql.types import Row
> from pyspark.sql.functions import *
> ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", 
> col("date").cast("date"))
> ds.where("date = '2022-01-01'").show()
> ds.write.mode("overwrite").parquet("s3a://bucket/test")
> df = spark.read.format("parquet").load("s3a://bucket/test")
> df.where("date = '2022-01-01'").show()
> {code}
> The first show() returns data, while the second one - no.
> I've noticed that it's Kubernetes master related, as the same code snipped 
> works ok with master "local"
> UPD: if the column is used as a partition and has the type "date" or is de 
> facto date but has the type "string", there is no filtering problem.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40009) Add doc string to DataFrame union and unionAll

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576933#comment-17576933
 ] 

Apache Spark commented on SPARK-40009:
--

User 'khalidmammadov' has created a pull request for this issue:
https://github.com/apache/spark/pull/37441

> Add doc string to DataFrame union and unionAll
> --
>
> Key: SPARK-40009
> URL: https://issues.apache.org/jira/browse/SPARK-40009
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Priority: Minor
>
> Provide examples for DataFrame union and unionAll functions for PySpark. Also 
> document parameters



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40009) Add doc string to DataFrame union and unionAll

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40009:


Assignee: Apache Spark

> Add doc string to DataFrame union and unionAll
> --
>
> Key: SPARK-40009
> URL: https://issues.apache.org/jira/browse/SPARK-40009
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Assignee: Apache Spark
>Priority: Minor
>
> Provide examples for DataFrame union and unionAll functions for PySpark. Also 
> document parameters



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40009) Add doc string to DataFrame union and unionAll

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40009:


Assignee: (was: Apache Spark)

> Add doc string to DataFrame union and unionAll
> --
>
> Key: SPARK-40009
> URL: https://issues.apache.org/jira/browse/SPARK-40009
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Priority: Minor
>
> Provide examples for DataFrame union and unionAll functions for PySpark. Also 
> document parameters



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40009) Add doc string to DataFrame union and unionAll

2022-08-08 Thread Khalid Mammadov (Jira)

Khalid Mammadov created SPARK-40009:
---

 Summary: Add doc string to DataFrame union and unionAll
 Key: SPARK-40009
 URL: https://issues.apache.org/jira/browse/SPARK-40009
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.4.0
Reporter: Khalid Mammadov


Provide examples for DataFrame union and unionAll functions for PySpark. Also 
document parameters



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39993) Spark on Kubernetes doesn't filter data by date

2022-08-08 Thread Hanna Liashchuk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanna Liashchuk updated SPARK-39993:

Description: 
I'm creating a Dataset with type date and saving it into s3. When I read it and 
try to use where() clause, I've noticed it doesn't return data even though it's 
there

Below is the code snippet I'm running

 
{code:java}
from pyspark.sql.types import Row
from pyspark.sql.functions import *
ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", 
col("date").cast("date"))
ds.where("date = '2022-01-01'").show()
ds.write.mode("overwrite").parquet("s3a://bucket/test")
df = spark.read.format("parquet").load("s3a://bucket/test")
df.where("date = '2022-01-01'").show()
{code}
The first show() returns data, while the second one - no.

I've noticed that it's Kubernetes master related, as the same code snipped 
works ok with master "local"

UPD: if the column is used as a partition and has the type "date", there is no 
filtering problem. 

 

 

  was:
I'm creating a Dataset with type date and saving it into s3. When I read it and 
try to use where() clause, I've noticed it doesn't return data even though it's 
there

Below is the code snippet I'm running



 
{code:java}
from pyspark.sql.types import Row
from pyspark.sql.functions import *
ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", 
col("date").cast("date"))
ds.where("date = '2022-01-01'").show()
ds.write.mode("overwrite").parquet("s3a://bucket/test")
df = spark.read.format("parquet").load("s3a://bucket/test")
df.where("date = '2022-01-01'").show()
{code}
The first show() returns data, while the second one - no.

I've noticed that it's Kubernetes master related, as the same code snipped 
works ok with master "local"

 

 


> Spark on Kubernetes doesn't filter data by date
> ---
>
> Key: SPARK-39993
> URL: https://issues.apache.org/jira/browse/SPARK-39993
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.2
> Environment: Kubernetes v1.23.6
> Spark 3.2.2
> Java 1.8.0_312
> Python 3.9.13
> Aws dependencies:
> aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar
>Reporter: Hanna Liashchuk
>Priority: Major
>  Labels: kubernetes
>
> I'm creating a Dataset with type date and saving it into s3. When I read it 
> and try to use where() clause, I've noticed it doesn't return data even 
> though it's there
> Below is the code snippet I'm running
>  
> {code:java}
> from pyspark.sql.types import Row
> from pyspark.sql.functions import *
> ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", 
> col("date").cast("date"))
> ds.where("date = '2022-01-01'").show()
> ds.write.mode("overwrite").parquet("s3a://bucket/test")
> df = spark.read.format("parquet").load("s3a://bucket/test")
> df.where("date = '2022-01-01'").show()
> {code}
> The first show() returns data, while the second one - no.
> I've noticed that it's Kubernetes master related, as the same code snipped 
> works ok with master "local"
> UPD: if the column is used as a partition and has the type "date", there is 
> no filtering problem. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39965) Skip PVC cleanup when driver doesn't own PVCs

2022-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-39965.
---
Fix Version/s: 3.3.1
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37433
[https://github.com/apache/spark/pull/37433]

> Skip PVC cleanup when driver doesn't own PVCs
> -
>
> Key: SPARK-39965
> URL: https://issues.apache.org/jira/browse/SPARK-39965
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Trivial
> Fix For: 3.3.1, 3.2.3, 3.4.0
>
>
> From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
> functionality is added to delete PVC if the Spark driver died. 
> [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]
>  
> However there are cases , where spark on K8s doesn't use PVC and use host 
> path for storage. 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> Now  in those cases ,
>  * it request to delete PVC (which is not required) .
>  * It also tries to delete in the case where driver doesn't own the PV (or 
> spark.kubernetes.driver.ownPersistentVolumeClaim is false) 
>  * Moreover in the cluster , where Spark user doesn't have access to list or 
> delete PVC , it throws exception .  
>  
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> GET at: 
> [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1].
>  Message: Forbidden!Configured service account doesn't have access. Service 
> account may have been revoked. persistentvolumeclaims is forbidden: User 
> "system:serviceaccount:dpi-dev:spark" cannot list resource 
> "persistentvolumeclaims" in API group "" in the namespace "<>".
>  
> *Solution*
> Ideally there should be configuration 
> spark.kubernetes.driver.pvc.deleteOnTermination or use 
> spark.kubernetes.driver.ownPersistentVolumeClaim  which should be checked 
> before calling to delete PVC. If user have not set up PV or if the driver 
> doesn't own  then there is no need to call the api and delete PVC . 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40008) Support casting integrals to intervals in ANSI mode

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40008:


Assignee: Max Gekk  (was: Apache Spark)

> Support casting integrals to intervals in ANSI mode
> ---
>
> Key: SPARK-40008
> URL: https://issues.apache.org/jira/browse/SPARK-40008
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Attachments: Screenshot 2022-06-12 at 13.04.44.png
>
>
> To conform the SQL standard, support casting of interval types to *INT, see 
> the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40008) Support casting integrals to intervals in ANSI mode

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576886#comment-17576886
 ] 

Apache Spark commented on SPARK-40008:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37442

> Support casting integrals to intervals in ANSI mode
> ---
>
> Key: SPARK-40008
> URL: https://issues.apache.org/jira/browse/SPARK-40008
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Attachments: Screenshot 2022-06-12 at 13.04.44.png
>
>
> To conform the SQL standard, support casting of interval types to *INT, see 
> the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40008) Support casting integrals to intervals in ANSI mode

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40008:


Assignee: Apache Spark  (was: Max Gekk)

> Support casting integrals to intervals in ANSI mode
> ---
>
> Key: SPARK-40008
> URL: https://issues.apache.org/jira/browse/SPARK-40008
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screenshot 2022-06-12 at 13.04.44.png
>
>
> To conform the SQL standard, support casting of interval types to *INT, see 
> the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40008) Support casting integrals to intervals in ANSI mode

2022-08-08 Thread Max Gekk (Jira)

Max Gekk created SPARK-40008:


 Summary: Support casting integrals to intervals in ANSI mode
 Key: SPARK-40008
 URL: https://issues.apache.org/jira/browse/SPARK-40008
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk
 Fix For: 3.4.0
 Attachments: Screenshot 2022-06-12 at 13.04.44.png

To conform the SQL standard, support casting of interval types to *INT, see the 
attached screenshot.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40008) Support casting integrals to intervals in ANSI mode

2022-08-08 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40008:
-
Fix Version/s: (was: 3.4.0)

> Support casting integrals to intervals in ANSI mode
> ---
>
> Key: SPARK-40008
> URL: https://issues.apache.org/jira/browse/SPARK-40008
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Attachments: Screenshot 2022-06-12 at 13.04.44.png
>
>
> To conform the SQL standard, support casting of interval types to *INT, see 
> the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39828) Catalog.listTables() should respect currentCatalog

2022-08-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39828:
---

Assignee: Wenchen Fan

> Catalog.listTables() should respect currentCatalog
> --
>
> Key: SPARK-39828
> URL: https://issues.apache.org/jira/browse/SPARK-39828
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39912) Refine CatalogImpl

2022-08-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39912.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37287
[https://github.com/apache/spark/pull/37287]

> Refine CatalogImpl
> --
>
> Key: SPARK-39912
> URL: https://issues.apache.org/jira/browse/SPARK-39912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39828) Catalog.listTables() should respect currentCatalog

2022-08-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39828.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37287
[https://github.com/apache/spark/pull/37287]

> Catalog.listTables() should respect currentCatalog
> --
>
> Key: SPARK-39828
> URL: https://issues.apache.org/jira/browse/SPARK-39828
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39912) Refine CatalogImpl

2022-08-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39912:
---

Assignee: Wenchen Fan

> Refine CatalogImpl
> --
>
> Key: SPARK-39912
> URL: https://issues.apache.org/jira/browse/SPARK-39912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35304) [k8s] Though finishing a job, the driver pod is running infinitely

2022-08-08 Thread Emilie Lin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576866#comment-17576866
 ] 

Emilie Lin commented on SPARK-35304:


Hi [~ocworld] do you have any updates for this issue?

> [k8s] Though finishing a job, the driver pod is running infinitely
> --
>
> Key: SPARK-35304
> URL: https://issues.apache.org/jira/browse/SPARK-35304
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2, 3.1.1
>Reporter: Keunhyun Oh
>Priority: Major
>
> Though finishing a job, the driver pod is running infinitely.
> Executors are terminated. However, the driver status is not changed to 
> succeeded.
> It is not experienced in spark 2 on k8s.
> It is only appeared on spark 3.
>  
> my jvm dump is that
> {code:java}
> 2021-05-04 15:11:37
> Full thread dump OpenJDK 64-Bit Server VM (25.252-b09 mixed mode):
> "Attach Listener" #182 daemon prio=9 os_prio=0 tid=0x7f02bc001000 
> nid=0x106 waiting on condition [0x]
>java.lang.Thread.State: RUNNABLE
>Locked ownable synchronizers:
>   - None
> "DestroyJavaVM" #179 prio=5 os_prio=0 tid=0x7f0fe0017000 nid=0x35 waiting 
> on condition [0x]
>java.lang.Thread.State: RUNNABLE
>Locked ownable synchronizers:
>   - None
> "s3a-transfer-unbounded-pool2-t1" #172 daemon prio=5 os_prio=0 
> tid=0x7f025d98d000 nid=0xe5 waiting on condition [0x7f01f86f3000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x7f0353681b38> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - None
> "java-sdk-progress-listener-callback-thread" #169 daemon prio=5 os_prio=0 
> tid=0x7f002000f000 nid=0xe2 waiting on condition [0x7f004f7f6000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x7f0bdb1ba7c0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - None
> "pool-26-thread-1" #72 prio=5 os_prio=0 tid=0x7f025c829000 nid=0x80 
> waiting on condition [0x7f01ba931000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x7f0bfdeaa8f0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - None
> "java-sdk-http-connection-reaper" #56 daemon prio=5 os_prio=0 
> tid=0x7f025d818000 nid=0x6e waiting on condition [0x7f01fb9fe000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> com.amazonaws.http.IdleConnectionReaper.run(IdleConnectionReaper.java:188)
>Locked ownable synchron

[jira] [Commented] (SPARK-39994) How to write (save) PySpark dataframe containing vector column?

2022-08-08 Thread Muhammad Kaleem Ullah (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576853#comment-17576853
 ] 

Muhammad Kaleem Ullah commented on SPARK-39994:
---

I would like to request that it should be included and provided in the package 
of PySpark so that we can expect to get a full-fledged version of PySpark that 
includes all functionality (including the write dataframe facility) so that we 
don't have to spend days on it. It's a humble request.

Thanks

> How to write (save) PySpark dataframe containing vector column?
> ---
>
> Key: SPARK-39994
> URL: https://issues.apache.org/jira/browse/SPARK-39994
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Muhammad Kaleem Ullah
>Priority: Major
> Attachments: df.PNG, error.PNG
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I'm trying to same the PySpark dataframe after transforming it using ML 
> Pipeline. But when I save it the weird error is triggered every time. Here 
> are the columns of this dataframe:
> |-- label: integer (nullable = true)
> |-- dest_index: double (nullable = false)
> |-- dest_fact: vector (nullable = true)
> |-- carrier_index: double (nullable = false)
> |-- carrier_fact: vector (nullable = true)
> |-- features: vector (nullable = true)
> And the following error occurs when trying to save this dataframe that 
> contains vector data:
> {code:java}
> // training.write.parquet("training_files.parquet", mode = "overwrite") {code}
> {noformat}
> Py4JJavaError: An error occurred while calling o440.parquet. : 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:278)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
> ...
> {noformat}
>  
> I tried to use differently available {{winutils}} for Hadoop from [this 
> GitHub repository|https://github.com/cdarlint/winutils] but with not much 
> luck. Please help me in this regard. How can I save this dataframe so that I 
> can read it in any other jupyter notebook file? Feel free to ask any 
> questions. Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39994) How to write (save) PySpark dataframe containing vector column?

2022-08-08 Thread Muhammad Kaleem Ullah (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576851#comment-17576851
 ] 

Muhammad Kaleem Ullah commented on SPARK-39994:
---

Hi [~hyukjin.kwon], here is the full stack trace:
 
{{---
Py4JJavaError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_4448\2574092106.py in ()
> 1 
training_df.write.format("parquet").mode("overwrite").save("training_data")

~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\sql\readwriter.py
 in save(self, path, format, mode, partitionBy, **options)966 
self._jwrite.save()967 else:
--> 968 self._jwrite.save(path)969970 @since(1.4)

~\AppData\Local\Programs\Python\Python310\lib\site-packages\py4j\java_gateway.py
 in __call__(self, *args)   13191320 answer = 
self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(   1322 answer, 
self.gateway_client, self.target_id, self.name)   1323 
~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyspark\sql\utils.py
 in deco(*a, **kw)188 def deco(*a: Any, **kw: Any) -> Any:189   
  try:
--> 190 return f(*a, **kw)191 except Py4JJavaError as 
e:192 converted = convert_exception(e.java_exception)

~\AppData\Local\Programs\Python\Python310\lib\site-packages\py4j\protocol.py in 
get_return_value(answer, gateway_client, target_id, name)324 
value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)325 
if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(327 "An 
error occurred while calling \{0}{1}\{2}.\n".328 
format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o357.save.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:278)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
at 
org.apache.spark.sql.execution.Q

[jira] [Commented] (SPARK-39965) Skip PVC cleanup when driver doesn't own PVCs

2022-08-08 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576793#comment-17576793
 ] 

pralabhkumar commented on SPARK-39965:
--

[~dongjoon] Thx for taking this .  This is really helpful

> Skip PVC cleanup when driver doesn't own PVCs
> -
>
> Key: SPARK-39965
> URL: https://issues.apache.org/jira/browse/SPARK-39965
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Trivial
>
> From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
> functionality is added to delete PVC if the Spark driver died. 
> [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]
>  
> However there are cases , where spark on K8s doesn't use PVC and use host 
> path for storage. 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> Now  in those cases ,
>  * it request to delete PVC (which is not required) .
>  * It also tries to delete in the case where driver doesn't own the PV (or 
> spark.kubernetes.driver.ownPersistentVolumeClaim is false) 
>  * Moreover in the cluster , where Spark user doesn't have access to list or 
> delete PVC , it throws exception .  
>  
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> GET at: 
> [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1].
>  Message: Forbidden!Configured service account doesn't have access. Service 
> account may have been revoked. persistentvolumeclaims is forbidden: User 
> "system:serviceaccount:dpi-dev:spark" cannot list resource 
> "persistentvolumeclaims" in API group "" in the namespace "<>".
>  
> *Solution*
> Ideally there should be configuration 
> spark.kubernetes.driver.pvc.deleteOnTermination or use 
> spark.kubernetes.driver.ownPersistentVolumeClaim  which should be checked 
> before calling to delete PVC. If user have not set up PV or if the driver 
> doesn't own  then there is no need to call the api and delete PVC . 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param

2022-08-08 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-39976:
--
Labels: correctness  (was: )

> NULL check in ArrayIntersect adds extraneous null from first param
> --
>
> Key: SPARK-39976
> URL: https://issues.apache.org/jira/browse/SPARK-39976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Navin Kumar
>Priority: Major
>  Labels: correctness
>
> This is very likely a regression from SPARK-36829.
> When using {{array_intersect(a, b)}}, if the first parameter contains a 
> {{NULL}} value and the second one does not, an extraneous {{NULL}} is present 
> in the output. This also leads to {{array_intersect(a, b) != 
> array_intersect(b, a)}} which is incorrect as set intersection should be 
> commutative.
> Example using PySpark:
> {code:python}
> >>> a = [1, 2, 3]
> >>> b = [3, None, 5]
> >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"])
> >>> df.show()
> +-++
> |a|   b|
> +-++
> |[1, 2, 3]|[3, null, 5]|
> +-++
> >>> df.selectExpr("array_intersect(a,b)").show()
> +-+
> |array_intersect(a, b)|
> +-+
> |  [3]|
> +-+
> >>> df.selectExpr("array_intersect(b,a)").show()
> +-+
> |array_intersect(b, a)|
> +-+
> |[3, null]|
> +-+
> {code}
> Note that in the first case, {{a}} does not contain a {{NULL}}, and the final 
> output is correct: {{[3]}}. In the second case, since {{b}} does contain 
> {{NULL}} and is now the first parameter.
> The same behavior occurs in Scala when writing to Parquet:
> {code:scala}
> scala> val a = Array[java.lang.Integer](1, 2, null, 4)
> a: Array[Integer] = Array(1, 2, null, 4)
> scala> val b = Array[java.lang.Integer](4, 5, 6, 7)
> b: Array[Integer] = Array(4, 5, 6, 7)
> scala> val df = Seq((a, b)).toDF("a","b")
> df: org.apache.spark.sql.DataFrame = [a: array, b: array]
> scala> df.write.parquet("/tmp/simple.parquet")
> scala> val df = spark.read.parquet("/tmp/simple.parquet")
> df: org.apache.spark.sql.DataFrame = [a: array, b: array]
> scala> df.show()
> +---++
> |  a|   b|
> +---++
> |[1, 2, null, 4]|[4, 5, 6, 7]|
> +---++
> scala> df.selectExpr("array_intersect(a,b)").show()
> +-+
> |array_intersect(a, b)|
> +-+
> |[null, 4]|
> +-+
> scala> df.selectExpr("array_intersect(b,a)").show()
> +-+
> |array_intersect(b, a)|
> +-+
> |  [4]|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576761#comment-17576761
 ] 

Apache Spark commented on SPARK-36663:
--

User 'mcdull-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37440

> When the existing field name is a number, an error will be reported when 
> reading the orc file
> -
>
> Key: SPARK-36663
> URL: https://issues.apache.org/jira/browse/SPARK-36663
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: mcdull_zhang
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2021-09-03-20-56-28-846.png
>
>
> You can use the following methods to reproduce the problem:
> {quote}val path = "file:///tmp/test_orc"
> spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)
> spark.read.orc(path)
> {quote}
> The error message is like this:
> {quote}org.apache.spark.sql.catalyst.parser.ParseException:
>  mismatched input '100' expecting {'ADD', 'AFTER'
> == SQL ==
>  struct<100:bigint>
>  ---^^^
> {quote}
> The error is actually issued by this line of code:
> {quote}CatalystSqlParser.parseDataType("100:bigint")
> {quote}
>  
> The specific background is that spark calls the above code in the process of 
> converting the schema of the orc file into the catalyst schema.
> {quote}// code in OrcUtils
>  private def toCatalystSchema(schema: TypeDescription): StructType =
> Unknown macro: \{  
> CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
>  }{quote}
> There are two solutions I currently think of:
>  # Modify the syntax analysis of SparkSQL to identify this kind of schema
>  # The TypeDescription.toString method should add the quote symbol to the 
> numeric column name, because the following syntax is supported:
> {quote}CatalystSqlParser.parseDataType("`100`:bigint")
> {quote}
> But currently TypeDescription does not support changing the UNQUOTED_NAMES 
> variable, should we first submit a pr to the orc project to support the 
> configuration of this variable。
> !image-2021-09-03-20-56-28-846.png!
>  
> How do spark members think about this issue?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576759#comment-17576759
 ] 

Apache Spark commented on SPARK-36663:
--

User 'mcdull-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37440

> When the existing field name is a number, an error will be reported when 
> reading the orc file
> -
>
> Key: SPARK-36663
> URL: https://issues.apache.org/jira/browse/SPARK-36663
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: mcdull_zhang
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2021-09-03-20-56-28-846.png
>
>
> You can use the following methods to reproduce the problem:
> {quote}val path = "file:///tmp/test_orc"
> spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)
> spark.read.orc(path)
> {quote}
> The error message is like this:
> {quote}org.apache.spark.sql.catalyst.parser.ParseException:
>  mismatched input '100' expecting {'ADD', 'AFTER'
> == SQL ==
>  struct<100:bigint>
>  ---^^^
> {quote}
> The error is actually issued by this line of code:
> {quote}CatalystSqlParser.parseDataType("100:bigint")
> {quote}
>  
> The specific background is that spark calls the above code in the process of 
> converting the schema of the orc file into the catalyst schema.
> {quote}// code in OrcUtils
>  private def toCatalystSchema(schema: TypeDescription): StructType =
> Unknown macro: \{  
> CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
>  }{quote}
> There are two solutions I currently think of:
>  # Modify the syntax analysis of SparkSQL to identify this kind of schema
>  # The TypeDescription.toString method should add the quote symbol to the 
> numeric column name, because the following syntax is supported:
> {quote}CatalystSqlParser.parseDataType("`100`:bigint")
> {quote}
> But currently TypeDescription does not support changing the UNQUOTED_NAMES 
> variable, should we first submit a pr to the orc project to support the 
> configuration of this variable。
> !image-2021-09-03-20-56-28-846.png!
>  
> How do spark members think about this issue?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39896:


Assignee: (was: Apache Spark)

> The structural integrity of the plan is broken after 
> UnwrapCastInBinaryComparison
> -
>
> Key: SPARK-39896
> URL: https://issues.apache.org/jira/browse/SPARK-39896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> sql("create table t1(a decimal(3, 0)) using parquet")
> sql("insert into t1 values(100), (10), (1)")
> sql("select * from t1 where a in(10, 10, 0, 1.00)").show
> {code}
> {noformat}
> After applying rule 
> org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
> Operator Optimization before Inferring Filters, the structural integrity of 
> the plan is broken.
> java.lang.RuntimeException: After applying rule 
> org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
> Operator Optimization before Inferring Filters, the structural integrity of 
> the plan is broken.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39896:


Assignee: Apache Spark

> The structural integrity of the plan is broken after 
> UnwrapCastInBinaryComparison
> -
>
> Key: SPARK-39896
> URL: https://issues.apache.org/jira/browse/SPARK-39896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {code:scala}
> sql("create table t1(a decimal(3, 0)) using parquet")
> sql("insert into t1 values(100), (10), (1)")
> sql("select * from t1 where a in(10, 10, 0, 1.00)").show
> {code}
> {noformat}
> After applying rule 
> org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
> Operator Optimization before Inferring Filters, the structural integrity of 
> the plan is broken.
> java.lang.RuntimeException: After applying rule 
> org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
> Operator Optimization before Inferring Filters, the structural integrity of 
> the plan is broken.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576754#comment-17576754
 ] 

Apache Spark commented on SPARK-39896:
--

User 'cfmcgrady' has created a pull request for this issue:
https://github.com/apache/spark/pull/37439

> The structural integrity of the plan is broken after 
> UnwrapCastInBinaryComparison
> -
>
> Key: SPARK-39896
> URL: https://issues.apache.org/jira/browse/SPARK-39896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> sql("create table t1(a decimal(3, 0)) using parquet")
> sql("insert into t1 values(100), (10), (1)")
> sql("select * from t1 where a in(10, 10, 0, 1.00)").show
> {code}
> {noformat}
> After applying rule 
> org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
> Operator Optimization before Inferring Filters, the structural integrity of 
> the plan is broken.
> java.lang.RuntimeException: After applying rule 
> org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
> Operator Optimization before Inferring Filters, the structural integrity of 
> the plan is broken.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40007) Add Mode to PySpark

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576747#comment-17576747
 ] 

Apache Spark commented on SPARK-40007:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37438

> Add Mode to PySpark
> ---
>
> Key: SPARK-40007
> URL: https://issues.apache.org/jira/browse/SPARK-40007
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40007) Add Mode to PySpark

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40007:


Assignee: (was: Apache Spark)

> Add Mode to PySpark
> ---
>
> Key: SPARK-40007
> URL: https://issues.apache.org/jira/browse/SPARK-40007
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40007) Add Mode to PySpark

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40007:


Assignee: Apache Spark

> Add Mode to PySpark
> ---
>
> Key: SPARK-40007
> URL: https://issues.apache.org/jira/browse/SPARK-40007
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40007) Add Mode to PySpark

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576746#comment-17576746
 ] 

Apache Spark commented on SPARK-40007:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37438

> Add Mode to PySpark
> ---
>
> Key: SPARK-40007
> URL: https://issues.apache.org/jira/browse/SPARK-40007
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40007) Add Mode to PySpark

2022-08-08 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-40007:
-

 Summary: Add Mode to PySpark
 Key: SPARK-40007
 URL: https://issues.apache.org/jira/browse/SPARK-40007
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40006) Make pyspark.sql.group examples self-contained

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40006:


Assignee: (was: Apache Spark)

> Make pyspark.sql.group examples self-contained
> --
>
> Key: SPARK-40006
> URL: https://issues.apache.org/jira/browse/SPARK-40006
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40006) Make pyspark.sql.group examples self-contained

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576744#comment-17576744
 ] 

Apache Spark commented on SPARK-40006:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37437

> Make pyspark.sql.group examples self-contained
> --
>
> Key: SPARK-40006
> URL: https://issues.apache.org/jira/browse/SPARK-40006
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40006) Make pyspark.sql.group examples self-contained

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40006:


Assignee: Apache Spark

> Make pyspark.sql.group examples self-contained
> --
>
> Key: SPARK-40006
> URL: https://issues.apache.org/jira/browse/SPARK-40006
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40005) Self-contained examples with parameter descriptions in PySpark documentation

2022-08-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40005:
-
Target Version/s: 3.4.0

> Self-contained examples with parameter descriptions in PySpark documentation
> 
>
> Key: SPARK-40005
> URL: https://issues.apache.org/jira/browse/SPARK-40005
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA aims to improve PySpark documentation in:
> - {{pyspark}}
> - {{pyspark.ml}}
> - {{pyspark.sql}}
> - {{pyspark.sql.streaming}}
> We should:
> - Make the examples self-contained, e.g., 
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
> - Document {{Parameters}} 
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot.
>  There are many API that misses parameters in PySpark, e.g., 
> [DataFrame.union|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.union.html#pyspark.sql.DataFrame.union]
> If the size of file is large, e.g., dataframe.py, we should split that down 
> into each subtask, and improve documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries

2022-08-08 Thread Oleksandr Shevchenko (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576731#comment-17576731
 ] 

Oleksandr Shevchenko commented on SPARK-39995:
--

Thanks [~hyukjin.kwon] for your reply. 
What do you think about support for package managers like 
[Poetry|https://python-poetry.org/] ? 
Is it possible to add parameters or add scala version into the package name to 
be able to install Spark with 2.13 since package managers don't support using 
env vars to configure it?

> PySpark installation doesn't support Scala 2.13 binaries
> 
>
> Key: SPARK-39995
> URL: https://issues.apache.org/jira/browse/SPARK-39995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Oleksandr Shevchenko
>Priority: Major
>
> [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary 
> [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi]
>  for Scala 2.13.
> Currently, the setup 
> [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py]
>  allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror 
> (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always 
> Scala 2.12 compatible binaries. There isn't any parameter to download 
> "spark-3.3.0-bin-hadoop3-scala2.13.tgz".
> It's possible to download Spark manually and set the needed SPARK_HOME, but 
> it's hard to use with pip or Poetry.
> Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI 
> but not possible with package managers like Poetry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39976:


Assignee: Apache Spark

> NULL check in ArrayIntersect adds extraneous null from first param
> --
>
> Key: SPARK-39976
> URL: https://issues.apache.org/jira/browse/SPARK-39976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Navin Kumar
>Assignee: Apache Spark
>Priority: Major
>
> This is very likely a regression from SPARK-36829.
> When using {{array_intersect(a, b)}}, if the first parameter contains a 
> {{NULL}} value and the second one does not, an extraneous {{NULL}} is present 
> in the output. This also leads to {{array_intersect(a, b) != 
> array_intersect(b, a)}} which is incorrect as set intersection should be 
> commutative.
> Example using PySpark:
> {code:python}
> >>> a = [1, 2, 3]
> >>> b = [3, None, 5]
> >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"])
> >>> df.show()
> +-++
> |a|   b|
> +-++
> |[1, 2, 3]|[3, null, 5]|
> +-++
> >>> df.selectExpr("array_intersect(a,b)").show()
> +-+
> |array_intersect(a, b)|
> +-+
> |  [3]|
> +-+
> >>> df.selectExpr("array_intersect(b,a)").show()
> +-+
> |array_intersect(b, a)|
> +-+
> |[3, null]|
> +-+
> {code}
> Note that in the first case, {{a}} does not contain a {{NULL}}, and the final 
> output is correct: {{[3]}}. In the second case, since {{b}} does contain 
> {{NULL}} and is now the first parameter.
> The same behavior occurs in Scala when writing to Parquet:
> {code:scala}
> scala> val a = Array[java.lang.Integer](1, 2, null, 4)
> a: Array[Integer] = Array(1, 2, null, 4)
> scala> val b = Array[java.lang.Integer](4, 5, 6, 7)
> b: Array[Integer] = Array(4, 5, 6, 7)
> scala> val df = Seq((a, b)).toDF("a","b")
> df: org.apache.spark.sql.DataFrame = [a: array, b: array]
> scala> df.write.parquet("/tmp/simple.parquet")
> scala> val df = spark.read.parquet("/tmp/simple.parquet")
> df: org.apache.spark.sql.DataFrame = [a: array, b: array]
> scala> df.show()
> +---++
> |  a|   b|
> +---++
> |[1, 2, null, 4]|[4, 5, 6, 7]|
> +---++
> scala> df.selectExpr("array_intersect(a,b)").show()
> +-+
> |array_intersect(a, b)|
> +-+
> |[null, 4]|
> +-+
> scala> df.selectExpr("array_intersect(b,a)").show()
> +-+
> |array_intersect(b, a)|
> +-+
> |  [4]|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576730#comment-17576730
 ] 

Apache Spark commented on SPARK-39976:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/37436

> NULL check in ArrayIntersect adds extraneous null from first param
> --
>
> Key: SPARK-39976
> URL: https://issues.apache.org/jira/browse/SPARK-39976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Navin Kumar
>Priority: Major
>
> This is very likely a regression from SPARK-36829.
> When using {{array_intersect(a, b)}}, if the first parameter contains a 
> {{NULL}} value and the second one does not, an extraneous {{NULL}} is present 
> in the output. This also leads to {{array_intersect(a, b) != 
> array_intersect(b, a)}} which is incorrect as set intersection should be 
> commutative.
> Example using PySpark:
> {code:python}
> >>> a = [1, 2, 3]
> >>> b = [3, None, 5]
> >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"])
> >>> df.show()
> +-++
> |a|   b|
> +-++
> |[1, 2, 3]|[3, null, 5]|
> +-++
> >>> df.selectExpr("array_intersect(a,b)").show()
> +-+
> |array_intersect(a, b)|
> +-+
> |  [3]|
> +-+
> >>> df.selectExpr("array_intersect(b,a)").show()
> +-+
> |array_intersect(b, a)|
> +-+
> |[3, null]|
> +-+
> {code}
> Note that in the first case, {{a}} does not contain a {{NULL}}, and the final 
> output is correct: {{[3]}}. In the second case, since {{b}} does contain 
> {{NULL}} and is now the first parameter.
> The same behavior occurs in Scala when writing to Parquet:
> {code:scala}
> scala> val a = Array[java.lang.Integer](1, 2, null, 4)
> a: Array[Integer] = Array(1, 2, null, 4)
> scala> val b = Array[java.lang.Integer](4, 5, 6, 7)
> b: Array[Integer] = Array(4, 5, 6, 7)
> scala> val df = Seq((a, b)).toDF("a","b")
> df: org.apache.spark.sql.DataFrame = [a: array, b: array]
> scala> df.write.parquet("/tmp/simple.parquet")
> scala> val df = spark.read.parquet("/tmp/simple.parquet")
> df: org.apache.spark.sql.DataFrame = [a: array, b: array]
> scala> df.show()
> +---++
> |  a|   b|
> +---++
> |[1, 2, null, 4]|[4, 5, 6, 7]|
> +---++
> scala> df.selectExpr("array_intersect(a,b)").show()
> +-+
> |array_intersect(a, b)|
> +-+
> |[null, 4]|
> +-+
> scala> df.selectExpr("array_intersect(b,a)").show()
> +-+
> |array_intersect(b, a)|
> +-+
> |  [4]|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39976:


Assignee: (was: Apache Spark)

> NULL check in ArrayIntersect adds extraneous null from first param
> --
>
> Key: SPARK-39976
> URL: https://issues.apache.org/jira/browse/SPARK-39976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Navin Kumar
>Priority: Major
>
> This is very likely a regression from SPARK-36829.
> When using {{array_intersect(a, b)}}, if the first parameter contains a 
> {{NULL}} value and the second one does not, an extraneous {{NULL}} is present 
> in the output. This also leads to {{array_intersect(a, b) != 
> array_intersect(b, a)}} which is incorrect as set intersection should be 
> commutative.
> Example using PySpark:
> {code:python}
> >>> a = [1, 2, 3]
> >>> b = [3, None, 5]
> >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"])
> >>> df.show()
> +-++
> |a|   b|
> +-++
> |[1, 2, 3]|[3, null, 5]|
> +-++
> >>> df.selectExpr("array_intersect(a,b)").show()
> +-+
> |array_intersect(a, b)|
> +-+
> |  [3]|
> +-+
> >>> df.selectExpr("array_intersect(b,a)").show()
> +-+
> |array_intersect(b, a)|
> +-+
> |[3, null]|
> +-+
> {code}
> Note that in the first case, {{a}} does not contain a {{NULL}}, and the final 
> output is correct: {{[3]}}. In the second case, since {{b}} does contain 
> {{NULL}} and is now the first parameter.
> The same behavior occurs in Scala when writing to Parquet:
> {code:scala}
> scala> val a = Array[java.lang.Integer](1, 2, null, 4)
> a: Array[Integer] = Array(1, 2, null, 4)
> scala> val b = Array[java.lang.Integer](4, 5, 6, 7)
> b: Array[Integer] = Array(4, 5, 6, 7)
> scala> val df = Seq((a, b)).toDF("a","b")
> df: org.apache.spark.sql.DataFrame = [a: array, b: array]
> scala> df.write.parquet("/tmp/simple.parquet")
> scala> val df = spark.read.parquet("/tmp/simple.parquet")
> df: org.apache.spark.sql.DataFrame = [a: array, b: array]
> scala> df.show()
> +---++
> |  a|   b|
> +---++
> |[1, 2, null, 4]|[4, 5, 6, 7]|
> +---++
> scala> df.selectExpr("array_intersect(a,b)").show()
> +-+
> |array_intersect(a, b)|
> +-+
> |[null, 4]|
> +-+
> scala> df.selectExpr("array_intersect(b,a)").show()
> +-+
> |array_intersect(b, a)|
> +-+
> |  [4]|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param

2022-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576729#comment-17576729
 ] 

Apache Spark commented on SPARK-39976:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/37436

> NULL check in ArrayIntersect adds extraneous null from first param
> --
>
> Key: SPARK-39976
> URL: https://issues.apache.org/jira/browse/SPARK-39976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Navin Kumar
>Priority: Major
>
> This is very likely a regression from SPARK-36829.
> When using {{array_intersect(a, b)}}, if the first parameter contains a 
> {{NULL}} value and the second one does not, an extraneous {{NULL}} is present 
> in the output. This also leads to {{array_intersect(a, b) != 
> array_intersect(b, a)}} which is incorrect as set intersection should be 
> commutative.
> Example using PySpark:
> {code:python}
> >>> a = [1, 2, 3]
> >>> b = [3, None, 5]
> >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"])
> >>> df.show()
> +-++
> |a|   b|
> +-++
> |[1, 2, 3]|[3, null, 5]|
> +-++
> >>> df.selectExpr("array_intersect(a,b)").show()
> +-+
> |array_intersect(a, b)|
> +-+
> |  [3]|
> +-+
> >>> df.selectExpr("array_intersect(b,a)").show()
> +-+
> |array_intersect(b, a)|
> +-+
> |[3, null]|
> +-+
> {code}
> Note that in the first case, {{a}} does not contain a {{NULL}}, and the final 
> output is correct: {{[3]}}. In the second case, since {{b}} does contain 
> {{NULL}} and is now the first parameter.
> The same behavior occurs in Scala when writing to Parquet:
> {code:scala}
> scala> val a = Array[java.lang.Integer](1, 2, null, 4)
> a: Array[Integer] = Array(1, 2, null, 4)
> scala> val b = Array[java.lang.Integer](4, 5, 6, 7)
> b: Array[Integer] = Array(4, 5, 6, 7)
> scala> val df = Seq((a, b)).toDF("a","b")
> df: org.apache.spark.sql.DataFrame = [a: array, b: array]
> scala> df.write.parquet("/tmp/simple.parquet")
> scala> val df = spark.read.parquet("/tmp/simple.parquet")
> df: org.apache.spark.sql.DataFrame = [a: array, b: array]
> scala> df.show()
> +---++
> |  a|   b|
> +---++
> |[1, 2, null, 4]|[4, 5, 6, 7]|
> +---++
> scala> df.selectExpr("array_intersect(a,b)").show()
> +-+
> |array_intersect(a, b)|
> +-+
> |[null, 4]|
> +-+
> scala> df.selectExpr("array_intersect(b,a)").show()
> +-+
> |array_intersect(b, a)|
> +-+
> |  [4]|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40006) Make pyspark.sql.group examples self-contained

2022-08-08 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-40006:


 Summary: Make pyspark.sql.group examples self-contained
 Key: SPARK-40006
 URL: https://issues.apache.org/jira/browse/SPARK-40006
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40005) Self-contained examples with parameter descriptions in PySpark documentation

2022-08-08 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-40005:


 Summary: Self-contained examples with parameter descriptions in 
PySpark documentation
 Key: SPARK-40005
 URL: https://issues.apache.org/jira/browse/SPARK-40005
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


This JIRA aims to improve PySpark documentation in:
- {{pyspark}}
- {{pyspark.ml}}
- {{pyspark.sql}}
- {{pyspark.sql.streaming}}

We should:
- Make the examples self-contained, e.g., 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
- Document {{Parameters}} 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot.
 There are many API that misses parameters in PySpark, e.g., 
[DataFrame.union|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.union.html#pyspark.sql.DataFrame.union]

If the size of file is large, e.g., dataframe.py, we should split that down 
into each subtask, and improve documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39973) Avoid noisy warnings logs when spark.scheduler.listenerbus.metrics.maxListenerClassesTimed = 0

2022-08-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39973.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37432
[https://github.com/apache/spark/pull/37432]

> Avoid noisy warnings logs when 
> spark.scheduler.listenerbus.metrics.maxListenerClassesTimed = 0
> --
>
> Key: SPARK-39973
> URL: https://issues.apache.org/jira/browse/SPARK-39973
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> If {{spark.scheduler.listenerbus.metrics.maxListenerClassesTimed}} has been 
> set to {{0}} to disable listener timers then listener registration will 
> trigger noisy warnings like
> {code:java}
> LiveListenerBusMetrics: Not measuring processing time for listener class 
> org.apache.spark.sql.util.ExecutionListenerBus because a maximum of 0 
> listener classes are already timed.{code}
> warnings.
> We should change the code to not print this warning when 
> maxListenerClassesTimed = 0.
> I don't plan to work on this myself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39973) Avoid noisy warnings logs when spark.scheduler.listenerbus.metrics.maxListenerClassesTimed = 0

2022-08-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39973:


Assignee: Hyukjin Kwon

> Avoid noisy warnings logs when 
> spark.scheduler.listenerbus.metrics.maxListenerClassesTimed = 0
> --
>
> Key: SPARK-39973
> URL: https://issues.apache.org/jira/browse/SPARK-39973
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> If {{spark.scheduler.listenerbus.metrics.maxListenerClassesTimed}} has been 
> set to {{0}} to disable listener timers then listener registration will 
> trigger noisy warnings like
> {code:java}
> LiveListenerBusMetrics: Not measuring processing time for listener class 
> org.apache.spark.sql.util.ExecutionListenerBus because a maximum of 0 
> listener classes are already timed.{code}
> warnings.
> We should change the code to not print this warning when 
> maxListenerClassesTimed = 0.
> I don't plan to work on this myself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39753) Broadcast joins should pushdown join constraints as Filter to the larger relation

2022-08-08 Thread Nick Dimiduk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576643#comment-17576643
 ] 

Nick Dimiduk commented on SPARK-39753:
--

Linking to the original issue.

> Broadcast joins should pushdown join constraints as Filter to the larger 
> relation
> -
>
> Key: SPARK-39753
> URL: https://issues.apache.org/jira/browse/SPARK-39753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Victor Delépine
>Priority: Major
>
> SPARK-19609 was bulk-closed a while ago, but not fixed. I've decided to 
> re-open it here for more visibility, since I believe this bug has a major 
> impact and that fixing it could drastically improve the performance of many 
> pipelines.
> Allow me to paste the initial description again here:
> _For broadcast inner-joins, where the smaller relation is known to be small 
> enough to materialize on a worker, the set of values for all join columns is 
> known and fits in memory. Spark should translate these values into a 
> {{Filter}} pushed down to the datasource. The common join condition of 
> equality, i.e. {{{}lhs.a == rhs.a{}}}, can be written as an {{a in ...}} 
> clause. An example of pushing such filters is already present in the form of 
> {{IsNotNull}} filters via_ [~sameerag]{_}'s work on SPARK-12957 subtasks.{_}
> _This optimization could even work when the smaller relation does not fit 
> entirely in memory. This could be done by partitioning the smaller relation 
> into N pieces, applying this predicate pushdown for each piece, and unioning 
> the results._
>  
> Essentially, when doing a Broadcast join, the smaller side can be used to 
> filter down the bigger side before performing the join. As of today, the join 
> will read all partitions of the bigger side, without pruning partitions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40004) Redundant `LevelDB.get` in `RemoteBlockPushResolver`

2022-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40004:


Assignee: Apache Spark

> Redundant `LevelDB.get` in `RemoteBlockPushResolver`
> 
>
> Key: SPARK-40004
> URL: https://issues.apache.org/jira/browse/SPARK-40004
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> {code:java}
> void removeAppAttemptPathInfoFromDB(String appId, int attemptId) {
>   AppAttemptId appAttemptId = new AppAttemptId(appId, attemptId);
>   if (db != null) {
> try {
>   byte[] key = getDbAppAttemptPathsKey(appAttemptId);
>   if (db.get(key) != null) {
> db.delete(key);
>   }
> } catch (Exception e) {
>   logger.error("Failed to remove the application attempt {} local path in 
> DB",
>   appAttemptId, e);
> }
>   }
> }
>  {code}
> No need to check `db.get(key) != null` before delete. LevelDB will handle 
> this scene.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 107 matches

Mail list logo