date:20220810

[jira] [Resolved] (SPARK-40044) Incorrect target interval type in cast overflow errors

2022-08-10 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40044.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37470
[https://github.com/apache/spark/pull/37470]

> Incorrect target interval type in cast overflow errors
> --
>
> Key: SPARK-40044
> URL: https://issues.apache.org/jira/browse/SPARK-40044
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> The example below shows the issue:
> {code:sql}
> spark-sql> select CAST(9223372036854775807L AS INTERVAL YEAR TO MONTH);
> org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW] The value 
> 9223372036854775807L of the type "BIGINT" cannot be cast to "INTERVAL MONTH" 
> due to an overflow.
> {code}
> The target type "INTERVAL MONTH" is incorrect.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40044) Incorrect target interval type in cast overflow errors

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40044:


Assignee: Apache Spark  (was: Max Gekk)

> Incorrect target interval type in cast overflow errors
> --
>
> Key: SPARK-40044
> URL: https://issues.apache.org/jira/browse/SPARK-40044
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The example below shows the issue:
> {code:sql}
> spark-sql> select CAST(9223372036854775807L AS INTERVAL YEAR TO MONTH);
> org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW] The value 
> 9223372036854775807L of the type "BIGINT" cannot be cast to "INTERVAL MONTH" 
> due to an overflow.
> {code}
> The target type "INTERVAL MONTH" is incorrect.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40044) Incorrect target interval type in cast overflow errors

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40044:


Assignee: Max Gekk  (was: Apache Spark)

> Incorrect target interval type in cast overflow errors
> --
>
> Key: SPARK-40044
> URL: https://issues.apache.org/jira/browse/SPARK-40044
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The example below shows the issue:
> {code:sql}
> spark-sql> select CAST(9223372036854775807L AS INTERVAL YEAR TO MONTH);
> org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW] The value 
> 9223372036854775807L of the type "BIGINT" cannot be cast to "INTERVAL MONTH" 
> due to an overflow.
> {code}
> The target type "INTERVAL MONTH" is incorrect.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40044) Incorrect target interval type in cast overflow errors

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578261#comment-17578261
 ] 

Apache Spark commented on SPARK-40044:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37470

> Incorrect target interval type in cast overflow errors
> --
>
> Key: SPARK-40044
> URL: https://issues.apache.org/jira/browse/SPARK-40044
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The example below shows the issue:
> {code:sql}
> spark-sql> select CAST(9223372036854775807L AS INTERVAL YEAR TO MONTH);
> org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW] The value 
> 9223372036854775807L of the type "BIGINT" cannot be cast to "INTERVAL MONTH" 
> due to an overflow.
> {code}
> The target type "INTERVAL MONTH" is incorrect.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40044) Incorrect target interval type in cast overflow errors

2022-08-10 Thread Max Gekk (Jira)

Max Gekk created SPARK-40044:


 Summary: Incorrect target interval type in cast overflow errors
 Key: SPARK-40044
 URL: https://issues.apache.org/jira/browse/SPARK-40044
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


The example below shows the issue:

{code:sql}
spark-sql> select CAST(9223372036854775807L AS INTERVAL YEAR TO MONTH);
org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW] The value 
9223372036854775807L of the type "BIGINT" cannot be cast to "INTERVAL MONTH" 
due to an overflow.
{code}

The target type "INTERVAL MONTH" is incorrect.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40041) Add Document Parameters for pyspark.sql.window

2022-08-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40041:


Assignee: Qian Sun

> Add Document Parameters for pyspark.sql.window
> --
>
> Key: SPARK-40041
> URL: https://issues.apache.org/jira/browse/SPARK-40041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40041) Add Document Parameters for pyspark.sql.window

2022-08-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40041.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37476
[https://github.com/apache/spark/pull/37476]

> Add Document Parameters for pyspark.sql.window
> --
>
> Key: SPARK-40041
> URL: https://issues.apache.org/jira/browse/SPARK-40041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40020) centralize the code of qualifying identifiers in SessionCatalog

2022-08-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40020:
---

Assignee: Wenchen Fan

> centralize the code of qualifying identifiers in SessionCatalog
> ---
>
> Key: SPARK-40020
> URL: https://issues.apache.org/jira/browse/SPARK-40020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40020) centralize the code of qualifying identifiers in SessionCatalog

2022-08-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40020.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37415
[https://github.com/apache/spark/pull/37415]

> centralize the code of qualifying identifiers in SessionCatalog
> ---
>
> Key: SPARK-40020
> URL: https://issues.apache.org/jira/browse/SPARK-40020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40026) spark-shell can not instantiate object whose class is defined in REPL dynamically

2022-08-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40026:
-
Priority: Major  (was: Blocker)

> spark-shell can not instantiate object whose class is defined in REPL 
> dynamically
> -
>
> Key: SPARK-40026
> URL: https://issues.apache.org/jira/browse/SPARK-40026
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell
>Affects Versions: 2.4.8, 3.0.3
> Environment: Spark2.3.x ~ Spark3.0.x
> Scala2.11.x ~ Scala2.13.x
> Java 8
>Reporter: Kernel Force
>Priority: Major
>
> spark-shell throws {{NoSuchMethodException}} if I define a class in REPL and 
> then call {{newInstance}} via reflection.
> {code:scala}
> Spark context available as 'sc' (master = yarn, app id = 
> application_1656488084960_0162).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.3
>   /_/
>  
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_141)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> class Demo {
>  |   def demo(s: String): Unit = println(s)
>  | }
> defined class Demo
> scala> classOf[Demo].newInstance().demo("OK")
> java.lang.InstantiationException: Demo
>   at java.lang.Class.newInstance(Class.java:427)
>   ... 47 elided
> Caused by: java.lang.NoSuchMethodException: Demo.()
>   at java.lang.Class.getConstructor0(Class.java:3082)
>   at java.lang.Class.newInstance(Class.java:412)
>   ... 47 more
> {code}
> But the same code works fine in native scala REPL:
> {code:scala}
> Welcome to Scala 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131).
> Type in expressions for evaluation. Or try :help.
> scala> class Demo {
>  |   def demo(s: String): Unit = println(s)
>  | }
> defined class Demo
> scala> classOf[Demo].newInstance().demo("OK")
> OK
> {code}
> What's the difference between spark-shell REPL and native scala REPL?
> I guess the Demo class might be treated as inner class in spark-shell REPL.
> But ... how to solve the problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40026) spark-shell can not instantiate object whose class is defined in REPL dynamically

2022-08-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40026:
-
Issue Type: Bug  (was: Question)

> spark-shell can not instantiate object whose class is defined in REPL 
> dynamically
> -
>
> Key: SPARK-40026
> URL: https://issues.apache.org/jira/browse/SPARK-40026
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.4.8, 3.0.3
> Environment: Spark2.3.x ~ Spark3.0.x
> Scala2.11.x ~ Scala2.13.x
> Java 8
>Reporter: Kernel Force
>Priority: Major
>
> spark-shell throws {{NoSuchMethodException}} if I define a class in REPL and 
> then call {{newInstance}} via reflection.
> {code:scala}
> Spark context available as 'sc' (master = yarn, app id = 
> application_1656488084960_0162).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.3
>   /_/
>  
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_141)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> class Demo {
>  |   def demo(s: String): Unit = println(s)
>  | }
> defined class Demo
> scala> classOf[Demo].newInstance().demo("OK")
> java.lang.InstantiationException: Demo
>   at java.lang.Class.newInstance(Class.java:427)
>   ... 47 elided
> Caused by: java.lang.NoSuchMethodException: Demo.()
>   at java.lang.Class.getConstructor0(Class.java:3082)
>   at java.lang.Class.newInstance(Class.java:412)
>   ... 47 more
> {code}
> But the same code works fine in native scala REPL:
> {code:scala}
> Welcome to Scala 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131).
> Type in expressions for evaluation. Or try :help.
> scala> class Demo {
>  |   def demo(s: String): Unit = println(s)
>  | }
> defined class Demo
> scala> classOf[Demo].newInstance().demo("OK")
> OK
> {code}
> What's the difference between spark-shell REPL and native scala REPL?
> I guess the Demo class might be treated as inner class in spark-shell REPL.
> But ... how to solve the problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40043) Document DataStreamWriter.toTable and DataStreamReader.table

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578234#comment-17578234
 ] 

Apache Spark commented on SPARK-40043:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37477

> Document DataStreamWriter.toTable and DataStreamReader.table
> 
>
> Key: SPARK-40043
> URL: https://issues.apache.org/jira/browse/SPARK-40043
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-33836 added 
> DataStreamWriter.toTable and DataStreamReader.table but they are mistakenly 
> not documented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40043) Document DataStreamWriter.toTable and DataStreamReader.table

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578233#comment-17578233
 ] 

Apache Spark commented on SPARK-40043:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37477

> Document DataStreamWriter.toTable and DataStreamReader.table
> 
>
> Key: SPARK-40043
> URL: https://issues.apache.org/jira/browse/SPARK-40043
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-33836 added 
> DataStreamWriter.toTable and DataStreamReader.table but they are mistakenly 
> not documented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40043) Document DataStreamWriter.toTable and DataStreamReader.table

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40043:


Assignee: Apache Spark

> Document DataStreamWriter.toTable and DataStreamReader.table
> 
>
> Key: SPARK-40043
> URL: https://issues.apache.org/jira/browse/SPARK-40043
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-33836 added 
> DataStreamWriter.toTable and DataStreamReader.table but they are mistakenly 
> not documented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40043) Document DataStreamWriter.toTable and DataStreamReader.table

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40043:


Assignee: (was: Apache Spark)

> Document DataStreamWriter.toTable and DataStreamReader.table
> 
>
> Key: SPARK-40043
> URL: https://issues.apache.org/jira/browse/SPARK-40043
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-33836 added 
> DataStreamWriter.toTable and DataStreamReader.table but they are mistakenly 
> not documented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40043) Document DataStreamWriter.toTable and DataStreamReader.table

2022-08-10 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-40043:


 Summary: Document DataStreamWriter.toTable and 
DataStreamReader.table
 Key: SPARK-40043
 URL: https://issues.apache.org/jira/browse/SPARK-40043
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 3.2.2, 3.3.0, 3.1.3, 3.4.0
Reporter: Hyukjin Kwon


https://issues.apache.org/jira/browse/SPARK-33836 added 
DataStreamWriter.toTable and DataStreamReader.table but they are mistakenly not 
documented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40041) Add Document Parameters for pyspark.sql.window

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578230#comment-17578230
 ] 

Apache Spark commented on SPARK-40041:
--

User 'dcoliversun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37476

> Add Document Parameters for pyspark.sql.window
> --
>
> Key: SPARK-40041
> URL: https://issues.apache.org/jira/browse/SPARK-40041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40041) Add Document Parameters for pyspark.sql.window

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40041:


Assignee: Apache Spark

> Add Document Parameters for pyspark.sql.window
> --
>
> Key: SPARK-40041
> URL: https://issues.apache.org/jira/browse/SPARK-40041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40041) Add Document Parameters for pyspark.sql.window

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40041:


Assignee: (was: Apache Spark)

> Add Document Parameters for pyspark.sql.window
> --
>
> Key: SPARK-40041
> URL: https://issues.apache.org/jira/browse/SPARK-40041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40041) Add Document Parameters for pyspark.sql.window

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578229#comment-17578229
 ] 

Apache Spark commented on SPARK-40041:
--

User 'dcoliversun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37476

> Add Document Parameters for pyspark.sql.window
> --
>
> Key: SPARK-40041
> URL: https://issues.apache.org/jira/browse/SPARK-40041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40042) Make pyspark.sql.streaming.query examples self-contained

2022-08-10 Thread Qian Sun (Jira)

Qian Sun created SPARK-40042:


 Summary: Make pyspark.sql.streaming.query examples self-contained
 Key: SPARK-40042
 URL: https://issues.apache.org/jira/browse/SPARK-40042
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Qian Sun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39895) pyspark drop doesn't accept *cols

2022-08-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39895:


Assignee: Santosh Pingale

> pyspark drop doesn't accept *cols 
> --
>
> Key: SPARK-39895
> URL: https://issues.apache.org/jira/browse/SPARK-39895
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.3, 3.3.0, 3.2.2
>Reporter: Santosh Pingale
>Assignee: Santosh Pingale
>Priority: Minor
> Fix For: 3.4.0
>
>
> Pyspark dataframe drop has following signature:
> {color:#4c9aff}{{def drop(self, *cols: "ColumnOrName") -> 
> "DataFrame":}}{color}
> However when we try to pass multiple Column types to drop function it raises 
> TypeError
> {{each col in the param list should be a string}}
> *Minimal reproducible example:*
> {color:#4c9aff}values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), 
> ("id_1", 3, 3), ("id_2", 4, 3)]{color}
> {color:#4c9aff}df = spark.createDataFrame(values, "id string, point int, 
> count int"){color}
> |– id: string (nullable = true)|
> |– point: integer (nullable = true)|
> |– count: integer (nullable = true)|
> {color:#4c9aff}{{df.drop(df.point, df.count)}}{color}
> {quote}{color:#505f79}/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py 
> in drop(self, *cols){color}
> {color:#505f79}2537 for col in cols:{color}
> {color:#505f79}2538 if not isinstance(col, str):{color}
> {color:#505f79}-> 2539 raise TypeError("each col in the param list should be 
> a string"){color}
> {color:#505f79}2540 jdf = self._jdf.drop(self._jseq(cols)){color}
> {color:#505f79}2541{color}
> {color:#505f79}TypeError: each col in the param list should be a string{color}
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39895) pyspark drop doesn't accept *cols

2022-08-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39895.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37335
[https://github.com/apache/spark/pull/37335]

> pyspark drop doesn't accept *cols 
> --
>
> Key: SPARK-39895
> URL: https://issues.apache.org/jira/browse/SPARK-39895
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.3, 3.3.0, 3.2.2
>Reporter: Santosh Pingale
>Priority: Minor
> Fix For: 3.4.0
>
>
> Pyspark dataframe drop has following signature:
> {color:#4c9aff}{{def drop(self, *cols: "ColumnOrName") -> 
> "DataFrame":}}{color}
> However when we try to pass multiple Column types to drop function it raises 
> TypeError
> {{each col in the param list should be a string}}
> *Minimal reproducible example:*
> {color:#4c9aff}values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), 
> ("id_1", 3, 3), ("id_2", 4, 3)]{color}
> {color:#4c9aff}df = spark.createDataFrame(values, "id string, point int, 
> count int"){color}
> |– id: string (nullable = true)|
> |– point: integer (nullable = true)|
> |– count: integer (nullable = true)|
> {color:#4c9aff}{{df.drop(df.point, df.count)}}{color}
> {quote}{color:#505f79}/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py 
> in drop(self, *cols){color}
> {color:#505f79}2537 for col in cols:{color}
> {color:#505f79}2538 if not isinstance(col, str):{color}
> {color:#505f79}-> 2539 raise TypeError("each col in the param list should be 
> a string"){color}
> {color:#505f79}2540 jdf = self._jdf.drop(self._jseq(cols)){color}
> {color:#505f79}2541{color}
> {color:#505f79}TypeError: each col in the param list should be a string{color}
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40041) Add Document Parameters for pyspark.sql.window

2022-08-10 Thread Qian Sun (Jira)

Qian Sun created SPARK-40041:


 Summary: Add Document Parameters for pyspark.sql.window
 Key: SPARK-40041
 URL: https://issues.apache.org/jira/browse/SPARK-40041
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Qian Sun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39708) ALS Model Loading

2022-08-10 Thread Ruifeng Zheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578220#comment-17578220
 ] 

Ruifeng Zheng commented on SPARK-39708:
---

if you are using `pyspark.mllib.recommendation`, try to load it with 
`MatrixFactorizationModel.load(sc, path)`

https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.mllib.recommendation.MatrixFactorizationModel.html#pyspark.mllib.recommendation.MatrixFactorizationModel

> ALS Model Loading
> -
>
> Key: SPARK-39708
> URL: https://issues.apache.org/jira/browse/SPARK-39708
> Project: Spark
>  Issue Type: Question
>  Components: PySpark, Spark Submit
>Affects Versions: 3.2.0
>Reporter: zehra
>Priority: Critical
>  Labels: model, pyspark
>
> I have an ALS model and saved it with these codes: 
> {code:java}
>                 als_path = "saved_models/best"
>                 best_model.save(sc, path= als_path){code}
> However, when I try to load this model, it gives this error message:
>  
> {code:java}
>     ---> 10 model2 = ALS.load(als_path)
>     
>     File /usr/local/spark/python/pyspark/ml/util.py:332, in 
> MLReadable.load(cls, path)
>         329 @classmethod
>         330 def load(cls, path):
>         331     """Reads an ML instance from the input path, a shortcut of 
> `read().load(path)`."""
>     --> 332     return cls.read().load(path)
>     
>     File /usr/local/spark/python/pyspark/ml/util.py:282, in 
> JavaMLReader.load(self, path)
>         280 if not isinstance(path, str):
>         281     raise TypeError("path should be a string, got type %s" % 
> type(path))
>     --> 282 java_obj = self._jread.load(path)
>         283 if not hasattr(self._clazz, "_from_java"):
>         284     raise NotImplementedError("This Java ML type cannot be loaded 
> into Python currently: %r"
>         285                               % self._clazz)
>     
>     File 
> /usr/local/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py:1321, 
> in JavaMember.__call__(self, *args)
>        1315 command = proto.CALL_COMMAND_NAME +\
>        1316     self.command_header +\
>        1317     args_command +\
>        1318     proto.END_COMMAND_PART
>        1320 answer = self.gateway_client.send_command(command)
>     -> 1321 return_value = get_return_value(
>        1322     answer, self.gateway_client, self.target_id, self.name)
>        1324 for temp_arg in temp_args:
>        1325     temp_arg._detach()
>     
>     File /usr/local/spark/python/pyspark/sql/utils.py:111, in 
> capture_sql_exception..deco(*a, **kw)
>         109 def deco(*a, **kw):
>         110     try:
>     --> 111         return f(*a, **kw)
>         112     except py4j.protocol.Py4JJavaError as e:
>         113         converted = convert_exception(e.java_exception)
>     
>     File 
> /usr/local/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py:326, in 
> get_return_value(answer, gateway_client, target_id, name)
>         324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
>         325 if answer[1] == REFERENCE_TYPE:
>     --> 326     raise Py4JJavaError(
>         327         "An error occurred while calling {0}{1}{2}.\n".
>         328         format(target_id, ".", name), value)
>         329 else:
>         330     raise Py4JError(
>         331         "An error occurred while calling {0}{1}{2}. 
> Trace:\n{3}\n".
>         332         format(target_id, ".", name, value))
>     
>     Py4JJavaError: An error occurred while calling o372.load.
>     : org.json4s.MappingException: Did not find value which can be converted 
> into java.lang.String
>         at org.json4s.reflect.package$.fail(package.scala:53)
>         at org.json4s.Extraction$.$anonfun$convert$2(Extraction.scala:881)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.json4s.Extraction$.convert(Extraction.scala:881)
>         at org.json4s.Extraction$.$anonfun$extract$10(Extraction.scala:456)
>         at 
> org.json4s.Extraction$.$anonfun$customOrElse$1(Extraction.scala:780)
>  
> {code}
>  
> I both tried to use `ALS.load` or `ALSModel.load` as shown in the Apache 
> spark documentation:
> [https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html#:~:text=als_path%20%3D%20temp_path%20%2B%20%22/als%22%0A%3E%3E%3E][1]
>  
>   [1]: 
> https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html#:~:text=als_path%20%3D%20temp_path%20%2B%20%22/als%22%0A%3E%3E%3E



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39708) ALS Model Loading

2022-08-10 Thread Ruifeng Zheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578219#comment-17578219
 ] 

Ruifeng Zheng commented on SPARK-39708:
---

??File /usr/local/spark/python/pyspark/ml/util.py:332, in MLReadable.load(cls, 
path)??

it seems that you were useing the `pyspark.ml`, which is DataFrame based;

??best_model.save(sc, path= als_path)??

then the `pyspark.ml.recommendation.ALSModel`'s save method, do not have `sc` 
as a parameter

https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.ml.recommendation.ALSModel.html#pyspark.ml.recommendation.ALSModel.save




> ALS Model Loading
> -
>
> Key: SPARK-39708
> URL: https://issues.apache.org/jira/browse/SPARK-39708
> Project: Spark
>  Issue Type: Question
>  Components: PySpark, Spark Submit
>Affects Versions: 3.2.0
>Reporter: zehra
>Priority: Critical
>  Labels: model, pyspark
>
> I have an ALS model and saved it with these codes: 
> {code:java}
>                 als_path = "saved_models/best"
>                 best_model.save(sc, path= als_path){code}
> However, when I try to load this model, it gives this error message:
>  
> {code:java}
>     ---> 10 model2 = ALS.load(als_path)
>     
>     File /usr/local/spark/python/pyspark/ml/util.py:332, in 
> MLReadable.load(cls, path)
>         329 @classmethod
>         330 def load(cls, path):
>         331     """Reads an ML instance from the input path, a shortcut of 
> `read().load(path)`."""
>     --> 332     return cls.read().load(path)
>     
>     File /usr/local/spark/python/pyspark/ml/util.py:282, in 
> JavaMLReader.load(self, path)
>         280 if not isinstance(path, str):
>         281     raise TypeError("path should be a string, got type %s" % 
> type(path))
>     --> 282 java_obj = self._jread.load(path)
>         283 if not hasattr(self._clazz, "_from_java"):
>         284     raise NotImplementedError("This Java ML type cannot be loaded 
> into Python currently: %r"
>         285                               % self._clazz)
>     
>     File 
> /usr/local/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py:1321, 
> in JavaMember.__call__(self, *args)
>        1315 command = proto.CALL_COMMAND_NAME +\
>        1316     self.command_header +\
>        1317     args_command +\
>        1318     proto.END_COMMAND_PART
>        1320 answer = self.gateway_client.send_command(command)
>     -> 1321 return_value = get_return_value(
>        1322     answer, self.gateway_client, self.target_id, self.name)
>        1324 for temp_arg in temp_args:
>        1325     temp_arg._detach()
>     
>     File /usr/local/spark/python/pyspark/sql/utils.py:111, in 
> capture_sql_exception..deco(*a, **kw)
>         109 def deco(*a, **kw):
>         110     try:
>     --> 111         return f(*a, **kw)
>         112     except py4j.protocol.Py4JJavaError as e:
>         113         converted = convert_exception(e.java_exception)
>     
>     File 
> /usr/local/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py:326, in 
> get_return_value(answer, gateway_client, target_id, name)
>         324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
>         325 if answer[1] == REFERENCE_TYPE:
>     --> 326     raise Py4JJavaError(
>         327         "An error occurred while calling {0}{1}{2}.\n".
>         328         format(target_id, ".", name), value)
>         329 else:
>         330     raise Py4JError(
>         331         "An error occurred while calling {0}{1}{2}. 
> Trace:\n{3}\n".
>         332         format(target_id, ".", name, value))
>     
>     Py4JJavaError: An error occurred while calling o372.load.
>     : org.json4s.MappingException: Did not find value which can be converted 
> into java.lang.String
>         at org.json4s.reflect.package$.fail(package.scala:53)
>         at org.json4s.Extraction$.$anonfun$convert$2(Extraction.scala:881)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.json4s.Extraction$.convert(Extraction.scala:881)
>         at org.json4s.Extraction$.$anonfun$extract$10(Extraction.scala:456)
>         at 
> org.json4s.Extraction$.$anonfun$customOrElse$1(Extraction.scala:780)
>  
> {code}
>  
> I both tried to use `ALS.load` or `ALSModel.load` as shown in the Apache 
> spark documentation:
> [https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html#:~:text=als_path%20%3D%20temp_path%20%2B%20%22/als%22%0A%3E%3E%3E][1]
>  
>   [1]: 
> https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html#:~:text=als_path%20%3D%20temp_path%20%2B%20%22/als%22%0A%3E%3E%3E



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail:

[jira] [Updated] (SPARK-40021) `Cancel workflow` can not cancel some jobs

2022-08-10 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-40021:
--
Attachment: Screen Shot 2022-08-11 at 10.07.24.png

> `Cancel workflow` can not cancel some jobs
> --
>
> Key: SPARK-40021
> URL: https://issues.apache.org/jira/browse/SPARK-40021
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
> Attachments: Screen Shot 2022-08-09 at 22.49.13.png, Screen Shot 
> 2022-08-11 at 10.07.24.png
>
>
> I have been observing this behavior for a long time:
> sometime I want to cancel the workflow to release resource for other 
> workflows, so I click the {{*{color:red}Cancel workflow{color}*}} button, 
> most jobs in current workflow will be canceled in several minutes, but some 
> jobs can never be killed, including:
> * Run / Build modules: pyspark-sql, pyspark-mllib, pyspark-resource 
> * Run / Build modules: pyspark-core, pyspark-streaming, pyspark-ml 
> * Run / Build modules: pyspark-pandas 
> * Run / Build modules: pyspark-pandas-slow 
> * Run / Linters, licenses, dependencies and documentation generation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-40021) `Cancel workflow` can not cancel some jobs

2022-08-10 Thread Ruifeng Zheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578212#comment-17578212
 ] 

Ruifeng Zheng edited comment on SPARK-40021 at 8/11/22 2:08 AM:


??Did you try to kill those Python jobs individually???

sorry but I don't find a place to cancel a job individually.

I only tried the `cancel workflow` and `cancel run`, which should cancel the 
whole workflow


was (Author: podongfeng):
Did you try to kill those Python jobs individually?

> `Cancel workflow` can not cancel some jobs
> --
>
> Key: SPARK-40021
> URL: https://issues.apache.org/jira/browse/SPARK-40021
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
> Attachments: Screen Shot 2022-08-09 at 22.49.13.png, Screen Shot 
> 2022-08-11 at 10.07.24.png
>
>
> I have been observing this behavior for a long time:
> sometime I want to cancel the workflow to release resource for other 
> workflows, so I click the {{*{color:red}Cancel workflow{color}*}} button, 
> most jobs in current workflow will be canceled in several minutes, but some 
> jobs can never be killed, including:
> * Run / Build modules: pyspark-sql, pyspark-mllib, pyspark-resource 
> * Run / Build modules: pyspark-core, pyspark-streaming, pyspark-ml 
> * Run / Build modules: pyspark-pandas 
> * Run / Build modules: pyspark-pandas-slow 
> * Run / Linters, licenses, dependencies and documentation generation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40040) Push local limit to both sides if join condition is empty

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40040:


Assignee: Apache Spark

> Push local limit to both sides if join condition is empty
> -
>
> Key: SPARK-40040
> URL: https://issues.apache.org/jira/browse/SPARK-40040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40040) Push local limit to both sides if join condition is empty

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40040:


Assignee: (was: Apache Spark)

> Push local limit to both sides if join condition is empty
> -
>
> Key: SPARK-40040
> URL: https://issues.apache.org/jira/browse/SPARK-40040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40021) `Cancel workflow` can not cancel some jobs

2022-08-10 Thread Ruifeng Zheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578212#comment-17578212
 ] 

Ruifeng Zheng commented on SPARK-40021:
---

Did you try to kill those Python jobs individually?

> `Cancel workflow` can not cancel some jobs
> --
>
> Key: SPARK-40021
> URL: https://issues.apache.org/jira/browse/SPARK-40021
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
> Attachments: Screen Shot 2022-08-09 at 22.49.13.png
>
>
> I have been observing this behavior for a long time:
> sometime I want to cancel the workflow to release resource for other 
> workflows, so I click the {{*{color:red}Cancel workflow{color}*}} button, 
> most jobs in current workflow will be canceled in several minutes, but some 
> jobs can never be killed, including:
> * Run / Build modules: pyspark-sql, pyspark-mllib, pyspark-resource 
> * Run / Build modules: pyspark-core, pyspark-streaming, pyspark-ml 
> * Run / Build modules: pyspark-pandas 
> * Run / Build modules: pyspark-pandas-slow 
> * Run / Linters, licenses, dependencies and documentation generation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40040) Push local limit to both sides if join condition is empty

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578211#comment-17578211
 ] 

Apache Spark commented on SPARK-40040:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37475

> Push local limit to both sides if join condition is empty
> -
>
> Key: SPARK-40040
> URL: https://issues.apache.org/jira/browse/SPARK-40040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40040) Push local limit to both sides if join condition is empty

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578210#comment-17578210
 ] 

Apache Spark commented on SPARK-40040:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37475

> Push local limit to both sides if join condition is empty
> -
>
> Key: SPARK-40040
> URL: https://issues.apache.org/jira/browse/SPARK-40040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40040) Push local limit to both sides if join condition is empty

2022-08-10 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40040:

Summary: Push local limit to both sides if join condition is empty  (was: 
Push local limit through outer join if join condition is empty)

> Push local limit to both sides if join condition is empty
> -
>
> Key: SPARK-40040
> URL: https://issues.apache.org/jira/browse/SPARK-40040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40040) Push local limit through outer join if join condition is empty

2022-08-10 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-40040:
---

 Summary: Push local limit through outer join if join condition is 
empty
 Key: SPARK-40040
 URL: https://issues.apache.org/jira/browse/SPARK-40040
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-10 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578195#comment-17578195
 ] 

Jungtaek Lim commented on SPARK-40039:
--

Very interesting one to see! (Disclaimer: Abortable was something I worked with 
Steve.)

Have you gone through some benchmarks to figure out this works with small to 
big files? One thing I wonder is whether multipart upload performs well with 
tiny file. We have lots of tiny files in checkpoint and all files could be 
pretty tiny for stateless query.

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578193#comment-17578193
 ] 

Apache Spark commented on SPARK-40039:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/37474

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40039:


Assignee: Attila Zsolt Piros  (was: Apache Spark)

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578192#comment-17578192
 ] 

Apache Spark commented on SPARK-40039:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/37474

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40039:


Assignee: Apache Spark  (was: Attila Zsolt Piros)

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-10 Thread Attila Zsolt Piros (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-40039:
---
Summary: Introducing a streaming checkpoint file manager based on Hadoop's 
Abortable interface  (was: Introducing checkpoint file manager based on 
Hadoop's Abortable interface)

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40039) Introducing checkpoint file manager based on Hadoop's Abortable interface

2022-08-10 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578189#comment-17578189
 ] 

Attila Zsolt Piros commented on SPARK-40039:


I am working on this.

> Introducing checkpoint file manager based on Hadoop's Abortable interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40039) Introducing checkpoint file manager based on Hadoop's Abortable interface

2022-08-10 Thread Attila Zsolt Piros (Jira)

Attila Zsolt Piros created SPARK-40039:
--

 Summary: Introducing checkpoint file manager based on Hadoop's 
Abortable interface
 Key: SPARK-40039
 URL: https://issues.apache.org/jira/browse/SPARK-40039
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Attila Zsolt Piros
Assignee: Attila Zsolt Piros


Currently on S3 the checkpoint file manager (called 
FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
opened for an atomic stream a temporary file used instead and when the stream 
is committed the file is renamed.

But on S3 a rename will be a file copy. So it has some serious performance 
implication.


But on Hadoop 3 there is new interface introduce called *Abortable* and 
*S3AFileSystem* has this capability which is implemented by on top S3's 
multipart upload. So when the file is committed a POST is sent 
([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
 and when aborted a DELETE will be send 
([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39983) Should not cache unserialized broadcast relations on the driver

2022-08-10 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-39983.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37413
[https://github.com/apache/spark/pull/37413]

> Should not cache unserialized broadcast relations on the driver
> ---
>
> Key: SPARK-39983
> URL: https://issues.apache.org/jira/browse/SPARK-39983
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Alex Balikov
>Assignee: Alex Balikov
>Priority: Minor
> Fix For: 3.4.0
>
>
> In TorrentBroadcast.writeBlocks we store the unserialized broadcast object in 
> addition to the serialized version of it - 
> {code:java}
> private def writeBlocks(value: T): Int = {
> import StorageLevel._
> // Store a copy of the broadcast variable in the driver so that tasks run 
> on the driver
> // do not create a duplicate copy of the broadcast variable's value.
> val blockManager = SparkEnv.get.blockManager
> if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, 
> tellMaster = false)) {
>   throw new SparkException(s"Failed to store $broadcastId in 
> BlockManager")
> }
>  {code}
> In case of broadcast relations, these objects can be fairly large (60MB in 
> one observed case) and are not strictly necessary on the driver.
> Add the option to not keep the unserialized versions of the objects.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39983) Should not cache unserialized broadcast relations on the driver

2022-08-10 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-39983:
--

Assignee: Alex Balikov

> Should not cache unserialized broadcast relations on the driver
> ---
>
> Key: SPARK-39983
> URL: https://issues.apache.org/jira/browse/SPARK-39983
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Alex Balikov
>Assignee: Alex Balikov
>Priority: Minor
>
> In TorrentBroadcast.writeBlocks we store the unserialized broadcast object in 
> addition to the serialized version of it - 
> {code:java}
> private def writeBlocks(value: T): Int = {
> import StorageLevel._
> // Store a copy of the broadcast variable in the driver so that tasks run 
> on the driver
> // do not create a duplicate copy of the broadcast variable's value.
> val blockManager = SparkEnv.get.blockManager
> if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, 
> tellMaster = false)) {
>   throw new SparkException(s"Failed to store $broadcastId in 
> BlockManager")
> }
>  {code}
> In case of broadcast relations, these objects can be fairly large (60MB in 
> one observed case) and are not strictly necessary on the driver.
> Add the option to not keep the unserialized versions of the objects.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578168#comment-17578168
 ] 

Apache Spark commented on SPARK-40037:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/37473

> Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
> ---
>
> Key: SPARK-40037
> URL: https://issues.apache.org/jira/browse/SPARK-40037
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327]
> [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703]
> [releases log|https://github.com/google/tink/releases/tag/v1.7.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40038) spark.sql.files.maxPartitionBytes does not observe on-disk compression

2022-08-10 Thread RJ Marcus (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578169#comment-17578169
 ] 

RJ Marcus commented on SPARK-40038:
---

the screenshot with spillage didn't have the stage fully completed when I made 
the capture, which is why the total data input / output may not look correct 
(should be 230GB)

> spark.sql.files.maxPartitionBytes does not observe on-disk compression
> --
>
> Key: SPARK-40038
> URL: https://issues.apache.org/jira/browse/SPARK-40038
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output, Optimizer, PySpark, SQL
>Affects Versions: 3.2.0
> Environment: files:
> - ORC with snappy compression
> - 232 GB files on disk 
> - 1800 files on disk (pretty sure no individual file is over 200MB)
> - 9 partitions on disk
> cluster:
> - EMR 6.6.0 (spark 3.2.0)
> - cluster: 288 vCPU (executors), 1.1TB memory (executors)
> OS info:
> LSB Version:    
> :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
> Distributor ID:    Amazon
> Description:    Amazon Linux release 2 (Karoo)
> Release:    2
> Codename:    Karoo
>Reporter: RJ Marcus
>Priority: Major
> Attachments: Screenshot from 2022-08-10 16-50-37.png, Screenshot from 
> 2022-08-10 16-59-56.png
>
>
> Why does `spark.sql.files.maxPartitionBytes` estimate the number of 
> partitions based on {_}file size on disk instead of the uncompressed file 
> size{_}?
> For example I have a dataset that is 213GB on disk. When I read this in to my 
> application I get 2050 partitions based on the default value of 128MB for 
> maxPartitionBytes. My application is a simple broadcast index join that adds 
> 1 column to the dataframe and writes it out. There is no shuffle.
> Initially the size of input /output records seem ok, but I still get a large 
> amount of memory "spill" on the executors. I believe this is due to the data 
> being highly compressed and each partition becoming too big when it is 
> deserialized to work on in memory.
> !image-2022-08-10-16-59-05-233.png!
> (If I try to do a repartition immediately after reading I still see the first 
> stage spilling memory to disk, so that is not the right solution or what I'm 
> interested in.) 
> Instead, I attempt to lower maxPartitionBytes by the (average) compression 
> ratio of my files (about 7x, so let's round up to 8). So I set 
> maxPartitionBytes=16MB.  At this point  I see that spark is reading in from 
> the file in 12-28 MB chunks. Now it makes 14316 partitions on the initial 
> file read and completes with no spillage. 
> !image-2022-08-10-16-59-59-778.png!
>  
> Is there something I'm missing here? Is this just intended behavior? How can 
> I tune my partition size correctly for my application when I do not know how 
> much the data will be compressed ahead of time?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578167#comment-17578167
 ] 

Apache Spark commented on SPARK-40037:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/37473

> Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
> ---
>
> Key: SPARK-40037
> URL: https://issues.apache.org/jira/browse/SPARK-40037
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327]
> [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703]
> [releases log|https://github.com/google/tink/releases/tag/v1.7.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40038) spark.sql.files.maxPartitionBytes does not observe on-disk compression

2022-08-10 Thread RJ Marcus (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RJ Marcus updated SPARK-40038:
--
Attachment: Screenshot from 2022-08-10 16-59-56.png
Screenshot from 2022-08-10 16-50-37.png

> spark.sql.files.maxPartitionBytes does not observe on-disk compression
> --
>
> Key: SPARK-40038
> URL: https://issues.apache.org/jira/browse/SPARK-40038
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output, Optimizer, PySpark, SQL
>Affects Versions: 3.2.0
> Environment: files:
> - ORC with snappy compression
> - 232 GB files on disk 
> - 1800 files on disk (pretty sure no individual file is over 200MB)
> - 9 partitions on disk
> cluster:
> - EMR 6.6.0 (spark 3.2.0)
> - cluster: 288 vCPU (executors), 1.1TB memory (executors)
> OS info:
> LSB Version:    
> :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
> Distributor ID:    Amazon
> Description:    Amazon Linux release 2 (Karoo)
> Release:    2
> Codename:    Karoo
>Reporter: RJ Marcus
>Priority: Major
> Attachments: Screenshot from 2022-08-10 16-50-37.png, Screenshot from 
> 2022-08-10 16-59-56.png
>
>
> Why does `spark.sql.files.maxPartitionBytes` estimate the number of 
> partitions based on {_}file size on disk instead of the uncompressed file 
> size{_}?
> For example I have a dataset that is 213GB on disk. When I read this in to my 
> application I get 2050 partitions based on the default value of 128MB for 
> maxPartitionBytes. My application is a simple broadcast index join that adds 
> 1 column to the dataframe and writes it out. There is no shuffle.
> Initially the size of input /output records seem ok, but I still get a large 
> amount of memory "spill" on the executors. I believe this is due to the data 
> being highly compressed and each partition becoming too big when it is 
> deserialized to work on in memory.
> !image-2022-08-10-16-59-05-233.png!
> (If I try to do a repartition immediately after reading I still see the first 
> stage spilling memory to disk, so that is not the right solution or what I'm 
> interested in.) 
> Instead, I attempt to lower maxPartitionBytes by the (average) compression 
> ratio of my files (about 7x, so let's round up to 8). So I set 
> maxPartitionBytes=16MB.  At this point  I see that spark is reading in from 
> the file in 12-28 MB chunks. Now it makes 14316 partitions on the initial 
> file read and completes with no spillage. 
> !image-2022-08-10-16-59-59-778.png!
>  
> Is there something I'm missing here? Is this just intended behavior? How can 
> I tune my partition size correctly for my application when I do not know how 
> much the data will be compressed ahead of time?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40037:


Assignee: Apache Spark

> Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
> ---
>
> Key: SPARK-40037
> URL: https://issues.apache.org/jira/browse/SPARK-40037
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Major
>
> [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327]
> [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703]
> [releases log|https://github.com/google/tink/releases/tag/v1.7.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40037:


Assignee: (was: Apache Spark)

> Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
> ---
>
> Key: SPARK-40037
> URL: https://issues.apache.org/jira/browse/SPARK-40037
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327]
> [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703]
> [releases log|https://github.com/google/tink/releases/tag/v1.7.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40038) spark.sql.files.maxPartitionBytes does not observe on-disk compression

2022-08-10 Thread RJ Marcus (Jira)

RJ Marcus created SPARK-40038:
-

 Summary: spark.sql.files.maxPartitionBytes does not observe 
on-disk compression
 Key: SPARK-40038
 URL: https://issues.apache.org/jira/browse/SPARK-40038
 Project: Spark
  Issue Type: Question
  Components: Input/Output, Optimizer, PySpark, SQL
Affects Versions: 3.2.0
 Environment: files:
- ORC with snappy compression
- 232 GB files on disk 
- 1800 files on disk (pretty sure no individual file is over 200MB)
- 9 partitions on disk


cluster:
- EMR 6.6.0 (spark 3.2.0)
- cluster: 288 vCPU (executors), 1.1TB memory (executors)

OS info:
LSB Version:    
:core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID:    Amazon
Description:    Amazon Linux release 2 (Karoo)
Release:    2
Codename:    Karoo
Reporter: RJ Marcus


Why does `spark.sql.files.maxPartitionBytes` estimate the number of partitions 
based on {_}file size on disk instead of the uncompressed file size{_}?

For example I have a dataset that is 213GB on disk. When I read this in to my 
application I get 2050 partitions based on the default value of 128MB for 
maxPartitionBytes. My application is a simple broadcast index join that adds 1 
column to the dataframe and writes it out. There is no shuffle.

Initially the size of input /output records seem ok, but I still get a large 
amount of memory "spill" on the executors. I believe this is due to the data 
being highly compressed and each partition becoming too big when it is 
deserialized to work on in memory.

!image-2022-08-10-16-59-05-233.png!

(If I try to do a repartition immediately after reading I still see the first 
stage spilling memory to disk, so that is not the right solution or what I'm 
interested in.) 

Instead, I attempt to lower maxPartitionBytes by the (average) compression 
ratio of my files (about 7x, so let's round up to 8). So I set 
maxPartitionBytes=16MB.  At this point  I see that spark is reading in from the 
file in 12-28 MB chunks. Now it makes 14316 partitions on the initial file read 
and completes with no spillage. 

!image-2022-08-10-16-59-59-778.png!
 
Is there something I'm missing here? Is this just intended behavior? How can I 
tune my partition size correctly for my application when I do not know how much 
the data will be compressed ahead of time?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark

2022-08-10 Thread Boyang Jerry Peng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40025:
--
Description: 
Project Lightspeed is an umbrella project aimed at improving a couple of key 
aspects of Spark Streaming:
 * Improving the latency and ensuring it is predictable
 * Enhancing functionality for processing data with new operators and APIs

 

Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency

  was:
Project Lightspeed is an umbrella project aimed at improving a couple of key 
aspects of Spark Streaming:
 * Improving the latency and ensuring it is predictable
 * Enhancing functionality for processing data with new operators and APIs

 

Please reference full blog post for this project:
[https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html]
 

 

Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency


> Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
> --
>
> Key: SPARK-40025
> URL: https://issues.apache.org/jira/browse/SPARK-40025
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Project Lightspeed is an umbrella project aimed at improving a couple of key 
> aspects of Spark Streaming:
>  * Improving the latency and ensuring it is predictable
>  * Enhancing functionality for processing data with new operators and APIs
>  
> Umbrella Jira to track all tickets under Project Lightspeed
> SPARK-39585 - Multiple Stateful Operators in Structured Streaming
> SPARK-39586 - Advanced Windowing in Structured Streaming
> SPARK-39587 - Schema Evolution for Stateful Pipelines
> SPARK-39589 - Asynchronous I/O support
> SPARK-39590 - Python API for Arbitrary Stateful Processing
> SPARK-39591 - Offset Management Improvements
> SPARK-39592 - Asynchronous State Checkpointing
> SPARK-39593 - Configurable State Checkpointing Frequency



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark

2022-08-10 Thread Boyang Jerry Peng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40025:
--
Description: 
Project Lightspeed is an umbrella project aimed at improving a couple of key 
aspects of Spark Streaming:
 * Improving the latency and ensuring it is predictable
 * Enhancing functionality for processing data with new operators and APIs

 

Please reference full blog post for this project:
[https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html]
 

 

Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency

  was:
Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency


> Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
> --
>
> Key: SPARK-40025
> URL: https://issues.apache.org/jira/browse/SPARK-40025
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Project Lightspeed is an umbrella project aimed at improving a couple of key 
> aspects of Spark Streaming:
>  * Improving the latency and ensuring it is predictable
>  * Enhancing functionality for processing data with new operators and APIs
>  
> Please reference full blog post for this project:
> [https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html]
>  
>  
> Umbrella Jira to track all tickets under Project Lightspeed
> SPARK-39585 - Multiple Stateful Operators in Structured Streaming
> SPARK-39586 - Advanced Windowing in Structured Streaming
> SPARK-39587 - Schema Evolution for Stateful Pipelines
> SPARK-39589 - Asynchronous I/O support
> SPARK-39590 - Python API for Arbitrary Stateful Processing
> SPARK-39591 - Offset Management Improvements
> SPARK-39592 - Asynchronous State Checkpointing
> SPARK-39593 - Configurable State Checkpointing Frequency



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark

2022-08-10 Thread Boyang Jerry Peng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40025:
--
Summary: Project Lightspeed: Faster and Simpler Stream Processing with 
Apache Spark  (was: Project Lightspeed (Spark Streaming Improvements))

> Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
> --
>
> Key: SPARK-40025
> URL: https://issues.apache.org/jira/browse/SPARK-40025
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Umbrella Jira to track all tickets under Project Lightspeed
> SPARK-39585 - Multiple Stateful Operators in Structured Streaming
> SPARK-39586 - Advanced Windowing in Structured Streaming
> SPARK-39587 - Schema Evolution for Stateful Pipelines
> SPARK-39589 - Asynchronous I/O support
> SPARK-39590 - Python API for Arbitrary Stateful Processing
> SPARK-39591 - Offset Management Improvements
> SPARK-39592 - Asynchronous State Checkpointing
> SPARK-39593 - Configurable State Checkpointing Frequency



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2022-08-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-39743.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37416
[https://github.com/apache/spark/pull/37416]

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Assignee: zhiming she
>Priority: Minor
> Fix For: 3.4.0
>
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2022-08-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-39743:
-

Assignee: zhiming she

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Assignee: zhiming she
>Priority: Minor
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39887) Expression transform error

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578094#comment-17578094
 ] 

Apache Spark commented on SPARK-39887:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/37472

> Expression transform error
> --
>
> Key: SPARK-39887
> URL: https://issues.apache.org/jira/browse/SPARK-39887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.2.2
>Reporter: zhuml
>Priority: Major
>
> {code:java}
> spark.sql(
>   """
> |select to_date(a) a, to_date(b) b from
> |(select  a, a as b from
> |(select to_date(a) a from
> | values ('2020-02-01') as t1(a)
> | group by to_date(a)) t3
> |union all
> |select a, b from
> |(select to_date(a) a, to_date(b) b from
> |values ('2020-01-01','2020-01-02') as t1(a, b)
> | group by to_date(a), to_date(b)) t4) t5
> |group by to_date(a), to_date(b)
> |""".stripMargin).show(){code}
> result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01)
> expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38503) Add warn for getAdditionalPreKubernetesResources in executor side

2022-08-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38503.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35786
[https://github.com/apache/spark/pull/35786]

> Add warn for getAdditionalPreKubernetesResources in executor side
> -
>
> Key: SPARK-38503
> URL: https://issues.apache.org/jira/browse/SPARK-38503
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38503) Add warn for getAdditionalPreKubernetesResources in executor side

2022-08-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38503:
-

Assignee: Yikun Jiang

> Add warn for getAdditionalPreKubernetesResources in executor side
> -
>
> Key: SPARK-38503
> URL: https://issues.apache.org/jira/browse/SPARK-38503
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0

2022-08-10 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-40037:

Description: 
[CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647]

[Info at SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327]

[CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569]

[Info at SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703]

[releases log|https://github.com/google/tink/releases/tag/v1.7.0]


  was:
[https://www.cve.org/CVERecord?id=CVE-2022-25647 | ]  
[CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647]

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK]

[https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569]

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK]

[https://github.com/google/tink/releases/tag/v1.7.0|releases log]


> Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
> ---
>
> Key: SPARK-40037
> URL: https://issues.apache.org/jira/browse/SPARK-40037
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327]
> [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703]
> [releases log|https://github.com/google/tink/releases/tag/v1.7.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0

2022-08-10 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-40037:

Description: 
[https://www.cve.org/CVERecord?id=CVE-2022-25647 | ]  
[CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647]

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK]

[https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569]

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK]

[https://github.com/google/tink/releases/tag/v1.7.0|releases log]

  was:
[https://www.cve.org/CVERecord?id=CVE-2022-25647 | CVE-2022-25647]  

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK]

[https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569]

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK]

[https://github.com/google/tink/releases/tag/v1.7.0|releases log]


> Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
> ---
>
> Key: SPARK-40037
> URL: https://issues.apache.org/jira/browse/SPARK-40037
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [https://www.cve.org/CVERecord?id=CVE-2022-25647 | ]  
> [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647]
> [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at 
> SNYK]
> [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569]
> [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at 
> SNYK]
> [https://github.com/google/tink/releases/tag/v1.7.0|releases log]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0

2022-08-10 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-40037:

Description: 
[https://www.cve.org/CVERecord?id=CVE-2022-25647 | CVE-2022-25647]  

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK]

[https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569]

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK]

[https://github.com/google/tink/releases/tag/v1.7.0|releases log]

  was:
[https://www.cve.org/CVERecord?id=CVE-2022-25647| CVE-2022-25647]  

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK]

[https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569]

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK]

[https://github.com/google/tink/releases/tag/v1.7.0|releases log]


> Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
> ---
>
> Key: SPARK-40037
> URL: https://issues.apache.org/jira/browse/SPARK-40037
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [https://www.cve.org/CVERecord?id=CVE-2022-25647 | CVE-2022-25647]  
> [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at 
> SNYK]
> [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569]
> [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at 
> SNYK]
> [https://github.com/google/tink/releases/tag/v1.7.0|releases log]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0

2022-08-10 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-40037:

Description: 
[https://www.cve.org/CVERecord?id=CVE-2022-25647| CVE-2022-25647]  

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK]

[https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569]

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK]

[https://github.com/google/tink/releases/tag/v1.7.0|releases log]

  was:
[https://www.cve.org/CVERecord?id=CVE-2022-25647|CVE-2022-25647]  

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK]

[https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569]

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK]

[https://github.com/google/tink/releases/tag/v1.7.0|releases log]


> Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
> ---
>
> Key: SPARK-40037
> URL: https://issues.apache.org/jira/browse/SPARK-40037
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [https://www.cve.org/CVERecord?id=CVE-2022-25647| CVE-2022-25647]  
> [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at 
> SNYK]
> [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569]
> [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at 
> SNYK]
> [https://github.com/google/tink/releases/tag/v1.7.0|releases log]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0

2022-08-10 Thread Jira

Bjørn Jørgensen created SPARK-40037:
---

 Summary: Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
 Key: SPARK-40037
 URL: https://issues.apache.org/jira/browse/SPARK-40037
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 3.4.0
Reporter: Bjørn Jørgensen


[https://www.cve.org/CVERecord?id=CVE-2022-25647|CVE-2022-25647]  

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK]

[https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569]

[https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK]

[https://github.com/google/tink/releases/tag/v1.7.0|releases log]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40036) LevelDB/RocksDBIterator.next should return false after iterator or db close

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40036:


Assignee: (was: Apache Spark)

> LevelDB/RocksDBIterator.next should return false after iterator or db close
> ---
>
> Key: SPARK-40036
> URL: https://issues.apache.org/jira/browse/SPARK-40036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
> @Test
> public void testHasNextAndNextAfterIteratorClose() throws Exception {
>   String prefix = "test_db_iter_close.";
>   String suffix = ".ldb";
>   File path = File.createTempFile(prefix, suffix);
>   path.delete();
>   LevelDB db = new LevelDB(path);
>   // Write one records for test
>   db.write(createCustomType1(0));
>   KVStoreIterator iter =
> db.view(CustomType1.class).closeableIterator();
>   // iter should be true
>   assertTrue(iter.hasNext());
>   // close iter
>   iter.close();
>   // iter.hasNext should be false after iter close
>   assertFalse(iter.hasNext());
>   // iter.next should throw NoSuchElementException after iter close
>   assertThrows(NoSuchElementException.class, iter::next);
>   db.close();
>   assertTrue(path.exists());
>   FileUtils.deleteQuietly(path);
>   assertFalse(path.exists());
> }
> @Test
> public void testHasNextAndNextAfterDBClose() throws Exception {
>   String prefix = "test_db_db_close.";
>   String suffix = ".ldb";
>   File path = File.createTempFile(prefix, suffix);
>   path.delete();
>   LevelDB db = new LevelDB(path);
>   // Write one record for test
>   db.write(createCustomType1(0));
>   KVStoreIterator iter =
> db.view(CustomType1.class).closeableIterator();
>   // iter should be true
>   assertTrue(iter.hasNext());
>   // close db
>   db.close();
>   // iter.hasNext should be false after db close
>   assertFalse(iter.hasNext());
>   // iter.next should throw NoSuchElementException after db close
>   assertThrows(NoSuchElementException.class, iter::next);
>   assertTrue(path.exists());
>   FileUtils.deleteQuietly(path);
>   assertFalse(path.exists());
> } {code}
>  
> For the above two cases, when iterator/db is closed, `hasNext` will return 
> true, and `next` will return the value not obtained before close.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40036) LevelDB/RocksDBIterator.next should return false after iterator or db close

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40036:


Assignee: Apache Spark

> LevelDB/RocksDBIterator.next should return false after iterator or db close
> ---
>
> Key: SPARK-40036
> URL: https://issues.apache.org/jira/browse/SPARK-40036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> {code:java}
> @Test
> public void testHasNextAndNextAfterIteratorClose() throws Exception {
>   String prefix = "test_db_iter_close.";
>   String suffix = ".ldb";
>   File path = File.createTempFile(prefix, suffix);
>   path.delete();
>   LevelDB db = new LevelDB(path);
>   // Write one records for test
>   db.write(createCustomType1(0));
>   KVStoreIterator iter =
> db.view(CustomType1.class).closeableIterator();
>   // iter should be true
>   assertTrue(iter.hasNext());
>   // close iter
>   iter.close();
>   // iter.hasNext should be false after iter close
>   assertFalse(iter.hasNext());
>   // iter.next should throw NoSuchElementException after iter close
>   assertThrows(NoSuchElementException.class, iter::next);
>   db.close();
>   assertTrue(path.exists());
>   FileUtils.deleteQuietly(path);
>   assertFalse(path.exists());
> }
> @Test
> public void testHasNextAndNextAfterDBClose() throws Exception {
>   String prefix = "test_db_db_close.";
>   String suffix = ".ldb";
>   File path = File.createTempFile(prefix, suffix);
>   path.delete();
>   LevelDB db = new LevelDB(path);
>   // Write one record for test
>   db.write(createCustomType1(0));
>   KVStoreIterator iter =
> db.view(CustomType1.class).closeableIterator();
>   // iter should be true
>   assertTrue(iter.hasNext());
>   // close db
>   db.close();
>   // iter.hasNext should be false after db close
>   assertFalse(iter.hasNext());
>   // iter.next should throw NoSuchElementException after db close
>   assertThrows(NoSuchElementException.class, iter::next);
>   assertTrue(path.exists());
>   FileUtils.deleteQuietly(path);
>   assertFalse(path.exists());
> } {code}
>  
> For the above two cases, when iterator/db is closed, `hasNext` will return 
> true, and `next` will return the value not obtained before close.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40036) LevelDB/RocksDBIterator.next should return false after iterator or db close

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578066#comment-17578066
 ] 

Apache Spark commented on SPARK-40036:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37471

> LevelDB/RocksDBIterator.next should return false after iterator or db close
> ---
>
> Key: SPARK-40036
> URL: https://issues.apache.org/jira/browse/SPARK-40036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
> @Test
> public void testHasNextAndNextAfterIteratorClose() throws Exception {
>   String prefix = "test_db_iter_close.";
>   String suffix = ".ldb";
>   File path = File.createTempFile(prefix, suffix);
>   path.delete();
>   LevelDB db = new LevelDB(path);
>   // Write one records for test
>   db.write(createCustomType1(0));
>   KVStoreIterator iter =
> db.view(CustomType1.class).closeableIterator();
>   // iter should be true
>   assertTrue(iter.hasNext());
>   // close iter
>   iter.close();
>   // iter.hasNext should be false after iter close
>   assertFalse(iter.hasNext());
>   // iter.next should throw NoSuchElementException after iter close
>   assertThrows(NoSuchElementException.class, iter::next);
>   db.close();
>   assertTrue(path.exists());
>   FileUtils.deleteQuietly(path);
>   assertFalse(path.exists());
> }
> @Test
> public void testHasNextAndNextAfterDBClose() throws Exception {
>   String prefix = "test_db_db_close.";
>   String suffix = ".ldb";
>   File path = File.createTempFile(prefix, suffix);
>   path.delete();
>   LevelDB db = new LevelDB(path);
>   // Write one record for test
>   db.write(createCustomType1(0));
>   KVStoreIterator iter =
> db.view(CustomType1.class).closeableIterator();
>   // iter should be true
>   assertTrue(iter.hasNext());
>   // close db
>   db.close();
>   // iter.hasNext should be false after db close
>   assertFalse(iter.hasNext());
>   // iter.next should throw NoSuchElementException after db close
>   assertThrows(NoSuchElementException.class, iter::next);
>   assertTrue(path.exists());
>   FileUtils.deleteQuietly(path);
>   assertFalse(path.exists());
> } {code}
>  
> For the above two cases, when iterator/db is closed, `hasNext` will return 
> true, and `next` will return the value not obtained before close.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40036) LevelDB/RocksDBIterator.next should return false after iterator or db close

2022-08-10 Thread Yang Jie (Jira)

Yang Jie created SPARK-40036:


 Summary: LevelDB/RocksDBIterator.next should return false after 
iterator or db close
 Key: SPARK-40036
 URL: https://issues.apache.org/jira/browse/SPARK-40036
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Yang Jie


{code:java}
@Test
public void testHasNextAndNextAfterIteratorClose() throws Exception {
  String prefix = "test_db_iter_close.";
  String suffix = ".ldb";
  File path = File.createTempFile(prefix, suffix);
  path.delete();
  LevelDB db = new LevelDB(path);
  // Write one records for test
  db.write(createCustomType1(0));

  KVStoreIterator iter =
db.view(CustomType1.class).closeableIterator();
  // iter should be true
  assertTrue(iter.hasNext());
  // close iter
  iter.close();
  // iter.hasNext should be false after iter close
  assertFalse(iter.hasNext());
  // iter.next should throw NoSuchElementException after iter close
  assertThrows(NoSuchElementException.class, iter::next);

  db.close();
  assertTrue(path.exists());
  FileUtils.deleteQuietly(path);
  assertFalse(path.exists());
}

@Test
public void testHasNextAndNextAfterDBClose() throws Exception {
  String prefix = "test_db_db_close.";
  String suffix = ".ldb";
  File path = File.createTempFile(prefix, suffix);
  path.delete();
  LevelDB db = new LevelDB(path);
  // Write one record for test
  db.write(createCustomType1(0));

  KVStoreIterator iter =
db.view(CustomType1.class).closeableIterator();
  // iter should be true
  assertTrue(iter.hasNext());
  // close db
  db.close();
  // iter.hasNext should be false after db close
  assertFalse(iter.hasNext());
  // iter.next should throw NoSuchElementException after db close
  assertThrows(NoSuchElementException.class, iter::next);

  assertTrue(path.exists());
  FileUtils.deleteQuietly(path);
  assertFalse(path.exists());
} {code}
 

For the above two cases, when iterator/db is closed, `hasNext` will return 
true, and `next` will return the value not obtained before close.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38899) DS V2 supports push down datetime functions

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578052#comment-17578052
 ] 

Apache Spark commented on SPARK-38899:
--

User 'chenzhx' has created a pull request for this issue:
https://github.com/apache/spark/pull/37469

> DS V2 supports push down datetime functions
> ---
>
> Key: SPARK-38899
> URL: https://issues.apache.org/jira/browse/SPARK-38899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Zhixiong Chen
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38899) DS V2 supports push down datetime functions

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578051#comment-17578051
 ] 

Apache Spark commented on SPARK-38899:
--

User 'chenzhx' has created a pull request for this issue:
https://github.com/apache/spark/pull/37469

> DS V2 supports push down datetime functions
> ---
>
> Key: SPARK-38899
> URL: https://issues.apache.org/jira/browse/SPARK-38899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Zhixiong Chen
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40034) PathOutputCommitters to work with dynamic partition overwrite

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40034:


Assignee: (was: Apache Spark)

> PathOutputCommitters to work with dynamic partition overwrite
> -
>
> Key: SPARK-40034
> URL: https://issues.apache.org/jira/browse/SPARK-40034
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Priority: Minor
>
> sibling of MAPREDUCE-7403: allow PathOutputCommitter implementation to 
> declare that they support the semantics required by spark dynamic 
> partitioning:
> * rename to work as expected
> * working dir to be on same fs as final dir
> They will do this through implementing StreamCapabilities and adding a new 
> probe, "mapreduce.job.committer.dynamic.partitioning" ; the spark side 
> changes are to
> * postpone rejection of dynamic partition overwrite until the output 
> committer is created
> * allow it if the committer implements StreamCapabilities and returns true 
> for {{hasCapability("mapreduce.job.committer.dynamic.partitioning")))
> this isn't going to be supported by the s3a committers, they don't meet the 
> requirements. The manifest committer of MAPREDUCE-7341 running against abfs 
> and gcs does work. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40034) PathOutputCommitters to work with dynamic partition overwrite

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578014#comment-17578014
 ] 

Apache Spark commented on SPARK-40034:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/37468

> PathOutputCommitters to work with dynamic partition overwrite
> -
>
> Key: SPARK-40034
> URL: https://issues.apache.org/jira/browse/SPARK-40034
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Priority: Minor
>
> sibling of MAPREDUCE-7403: allow PathOutputCommitter implementation to 
> declare that they support the semantics required by spark dynamic 
> partitioning:
> * rename to work as expected
> * working dir to be on same fs as final dir
> They will do this through implementing StreamCapabilities and adding a new 
> probe, "mapreduce.job.committer.dynamic.partitioning" ; the spark side 
> changes are to
> * postpone rejection of dynamic partition overwrite until the output 
> committer is created
> * allow it if the committer implements StreamCapabilities and returns true 
> for {{hasCapability("mapreduce.job.committer.dynamic.partitioning")))
> this isn't going to be supported by the s3a committers, they don't meet the 
> requirements. The manifest committer of MAPREDUCE-7341 running against abfs 
> and gcs does work. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40034) PathOutputCommitters to work with dynamic partition overwrite

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40034:


Assignee: Apache Spark

> PathOutputCommitters to work with dynamic partition overwrite
> -
>
> Key: SPARK-40034
> URL: https://issues.apache.org/jira/browse/SPARK-40034
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Minor
>
> sibling of MAPREDUCE-7403: allow PathOutputCommitter implementation to 
> declare that they support the semantics required by spark dynamic 
> partitioning:
> * rename to work as expected
> * working dir to be on same fs as final dir
> They will do this through implementing StreamCapabilities and adding a new 
> probe, "mapreduce.job.committer.dynamic.partitioning" ; the spark side 
> changes are to
> * postpone rejection of dynamic partition overwrite until the output 
> committer is created
> * allow it if the committer implements StreamCapabilities and returns true 
> for {{hasCapability("mapreduce.job.committer.dynamic.partitioning")))
> this isn't going to be supported by the s3a committers, they don't meet the 
> requirements. The manifest committer of MAPREDUCE-7341 running against abfs 
> and gcs does work. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40035) Avoid apply filter twice when listing files

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40035:


Assignee: (was: Apache Spark)

> Avoid apply filter twice when listing files
> ---
>
> Key: SPARK-40035
> URL: https://issues.apache.org/jira/browse/SPARK-40035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: EdisonWang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40035) Avoid apply filter twice when listing files

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578008#comment-17578008
 ] 

Apache Spark commented on SPARK-40035:
--

User 'WangGuangxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/37467

> Avoid apply filter twice when listing files
> ---
>
> Key: SPARK-40035
> URL: https://issues.apache.org/jira/browse/SPARK-40035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: EdisonWang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40035) Avoid apply filter twice when listing files

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578009#comment-17578009
 ] 

Apache Spark commented on SPARK-40035:
--

User 'WangGuangxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/37467

> Avoid apply filter twice when listing files
> ---
>
> Key: SPARK-40035
> URL: https://issues.apache.org/jira/browse/SPARK-40035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: EdisonWang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40035) Avoid apply filter twice when listing files

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40035:


Assignee: Apache Spark

> Avoid apply filter twice when listing files
> ---
>
> Key: SPARK-40035
> URL: https://issues.apache.org/jira/browse/SPARK-40035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: EdisonWang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38910) Clean sparkStaging dir should before unregister()

2022-08-10 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-38910:
--
Fix Version/s: 3.4.0

> Clean sparkStaging dir should before unregister()
> -
>
> Key: SPARK-38910
> URL: https://issues.apache.org/jira/browse/SPARK-38910
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.1, 3.3.0
>Reporter: angerszhu
>Priority: Minor
> Fix For: 3.4.0
>
>
> {code:java}
>   ShutdownHookManager.addShutdownHook(priority) { () =>
> try {
>   val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf)
>   val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts
>   if (!finished) {
> // The default state of ApplicationMaster is failed if it is 
> invoked by shut down hook.
> // This behavior is different compared to 1.x version.
> // If user application is exited ahead of time by calling 
> System.exit(N), here mark
> // this application as failed with EXIT_EARLY. For a good 
> shutdown, user shouldn't call
> // System.exit(0) to terminate the application.
> finish(finalStatus,
>   ApplicationMaster.EXIT_EARLY,
>   "Shutdown hook called before final status was reported.")
>   }
>   if (!unregistered) {
> // we only want to unregister if we don't want the RM to retry
> if (finalStatus == FinalApplicationStatus.SUCCEEDED || 
> isLastAttempt) {
>   unregister(finalStatus, finalMsg)
>   cleanupStagingDir(new 
> Path(System.getenv("SPARK_YARN_STAGING_DIR")))
> }
>   }
> } catch {
>   case e: Throwable =>
> logWarning("Ignoring Exception while stopping ApplicationMaster 
> from shutdown hook", e)
> }
>   }{code}
> unregister may throw exception, clean staging dir should before unregister.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40035) Avoid apply filter twice when listing files

2022-08-10 Thread EdisonWang (Jira)

EdisonWang created SPARK-40035:
--

 Summary: Avoid apply filter twice when listing files
 Key: SPARK-40035
 URL: https://issues.apache.org/jira/browse/SPARK-40035
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: EdisonWang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38910) Clean sparkStaging dir should before unregister()

2022-08-10 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-38910.
---
Resolution: Fixed

> Clean sparkStaging dir should before unregister()
> -
>
> Key: SPARK-38910
> URL: https://issues.apache.org/jira/browse/SPARK-38910
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.1, 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.4.0
>
>
> {code:java}
>   ShutdownHookManager.addShutdownHook(priority) { () =>
> try {
>   val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf)
>   val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts
>   if (!finished) {
> // The default state of ApplicationMaster is failed if it is 
> invoked by shut down hook.
> // This behavior is different compared to 1.x version.
> // If user application is exited ahead of time by calling 
> System.exit(N), here mark
> // this application as failed with EXIT_EARLY. For a good 
> shutdown, user shouldn't call
> // System.exit(0) to terminate the application.
> finish(finalStatus,
>   ApplicationMaster.EXIT_EARLY,
>   "Shutdown hook called before final status was reported.")
>   }
>   if (!unregistered) {
> // we only want to unregister if we don't want the RM to retry
> if (finalStatus == FinalApplicationStatus.SUCCEEDED || 
> isLastAttempt) {
>   unregister(finalStatus, finalMsg)
>   cleanupStagingDir(new 
> Path(System.getenv("SPARK_YARN_STAGING_DIR")))
> }
>   }
> } catch {
>   case e: Throwable =>
> logWarning("Ignoring Exception while stopping ApplicationMaster 
> from shutdown hook", e)
> }
>   }{code}
> unregister may throw exception, clean staging dir should before unregister.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38910) Clean sparkStaging dir should before unregister()

2022-08-10 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-38910:
-

Assignee: angerszhu

> Clean sparkStaging dir should before unregister()
> -
>
> Key: SPARK-38910
> URL: https://issues.apache.org/jira/browse/SPARK-38910
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.1, 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.4.0
>
>
> {code:java}
>   ShutdownHookManager.addShutdownHook(priority) { () =>
> try {
>   val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf)
>   val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts
>   if (!finished) {
> // The default state of ApplicationMaster is failed if it is 
> invoked by shut down hook.
> // This behavior is different compared to 1.x version.
> // If user application is exited ahead of time by calling 
> System.exit(N), here mark
> // this application as failed with EXIT_EARLY. For a good 
> shutdown, user shouldn't call
> // System.exit(0) to terminate the application.
> finish(finalStatus,
>   ApplicationMaster.EXIT_EARLY,
>   "Shutdown hook called before final status was reported.")
>   }
>   if (!unregistered) {
> // we only want to unregister if we don't want the RM to retry
> if (finalStatus == FinalApplicationStatus.SUCCEEDED || 
> isLastAttempt) {
>   unregister(finalStatus, finalMsg)
>   cleanupStagingDir(new 
> Path(System.getenv("SPARK_YARN_STAGING_DIR")))
> }
>   }
> } catch {
>   case e: Throwable =>
> logWarning("Ignoring Exception while stopping ApplicationMaster 
> from shutdown hook", e)
> }
>   }{code}
> unregister may throw exception, clean staging dir should before unregister.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40034) PathOutputCommitters to work with dynamic partition overwrite

2022-08-10 Thread Steve Loughran (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-40034:
---
Summary: PathOutputCommitters to work with dynamic partition overwrite  
(was: PathOutputCommitters to work with dynamic partition overwrite -if they 
support it)

> PathOutputCommitters to work with dynamic partition overwrite
> -
>
> Key: SPARK-40034
> URL: https://issues.apache.org/jira/browse/SPARK-40034
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Priority: Minor
>
> sibling of MAPREDUCE-7403: allow PathOutputCommitter implementation to 
> declare that they support the semantics required by spark dynamic 
> partitioning:
> * rename to work as expected
> * working dir to be on same fs as final dir
> They will do this through implementing StreamCapabilities and adding a new 
> probe, "mapreduce.job.committer.dynamic.partitioning" ; the spark side 
> changes are to
> * postpone rejection of dynamic partition overwrite until the output 
> committer is created
> * allow it if the committer implements StreamCapabilities and returns true 
> for {{hasCapability("mapreduce.job.committer.dynamic.partitioning")))
> this isn't going to be supported by the s3a committers, they don't meet the 
> requirements. The manifest committer of MAPREDUCE-7341 running against abfs 
> and gcs does work. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40034) PathOutputCommitters to work with dynamic partition overwrite -if they support it

2022-08-10 Thread Steve Loughran (Jira)

Steve Loughran created SPARK-40034:
--

 Summary: PathOutputCommitters to work with dynamic partition 
overwrite -if they support it
 Key: SPARK-40034
 URL: https://issues.apache.org/jira/browse/SPARK-40034
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.4.0
Reporter: Steve Loughran


sibling of MAPREDUCE-7403: allow PathOutputCommitter implementation to declare 
that they support the semantics required by spark dynamic partitioning:

* rename to work as expected
* working dir to be on same fs as final dir

They will do this through implementing StreamCapabilities and adding a new 
probe, "mapreduce.job.committer.dynamic.partitioning" ; the spark side changes 
are to
* postpone rejection of dynamic partition overwrite until the output committer 
is created
* allow it if the committer implements StreamCapabilities and returns true for 
{{hasCapability("mapreduce.job.committer.dynamic.partitioning")))

this isn't going to be supported by the s3a committers, they don't meet the 
requirements. The manifest committer of MAPREDUCE-7341 running against abfs and 
gcs does work. 




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39734) Add call_udf to pyspark.sql.functions

2022-08-10 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-39734:
-

Assignee: Ruifeng Zheng  (was: Andrew Ray)

> Add call_udf to pyspark.sql.functions
> -
>
> Key: SPARK-39734
> URL: https://issues.apache.org/jira/browse/SPARK-39734
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Andrew Ray
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>
> Add the call_udf function to PySpark for parity with the Scala/Java function 
> org.apache.spark.sql.functions#call_udf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36259) Expose localtimestamp in pyspark.sql.functions

2022-08-10 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-36259:
-

Assignee: Ruifeng Zheng

> Expose localtimestamp in pyspark.sql.functions
> --
>
> Key: SPARK-36259
> URL: https://issues.apache.org/jira/browse/SPARK-36259
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>
> localtimestamp is available in the scala sql functions, but currently not in 
> pyspark



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39733) Add map_contains_key to pyspark.sql.functions

2022-08-10 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-39733:
-

Assignee: Ruifeng Zheng

> Add map_contains_key to pyspark.sql.functions
> -
>
> Key: SPARK-39733
> URL: https://issues.apache.org/jira/browse/SPARK-39733
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Andrew Ray
>Assignee: Ruifeng Zheng
>Priority: Minor
>
> SPARK-37584 added the function map_contains_key to SQL and Scala/Java 
> functions. This JIRA is to track its addition to the PySpark function set for 
> parity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37348) PySpark pmod function

2022-08-10 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-37348.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37449
[https://github.com/apache/spark/pull/37449]

> PySpark pmod function
> -
>
> Key: SPARK-37348
> URL: https://issues.apache.org/jira/browse/SPARK-37348
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Tim Schwab
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>
> Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns 
> -1. However, the modulus is often desired instead of the remainder.
>  
> There is a [PMOD() function in Spark 
> SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in 
> PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions].
>  So at the moment, the two options for getting the modulus is to use 
> F.expr("pmod(A, B)"), or create a helper function such as:
>  
> {code:java}
> def pmod(dividend, divisor):
> return F.when(dividend < 0, (dividend % divisor) + 
> divisor).otherwise(dividend % divisor){code}
>  
>  
> Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37348) PySpark pmod function

2022-08-10 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-37348:
-

Assignee: Ruifeng Zheng

> PySpark pmod function
> -
>
> Key: SPARK-37348
> URL: https://issues.apache.org/jira/browse/SPARK-37348
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Tim Schwab
>Assignee: Ruifeng Zheng
>Priority: Minor
>
> Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns 
> -1. However, the modulus is often desired instead of the remainder.
>  
> There is a [PMOD() function in Spark 
> SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in 
> PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions].
>  So at the moment, the two options for getting the modulus is to use 
> F.expr("pmod(A, B)"), or create a helper function such as:
>  
> {code:java}
> def pmod(dividend, divisor):
> return F.when(dividend < 0, (dividend % divisor) + 
> divisor).otherwise(dividend % divisor){code}
>  
>  
> Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39733) Add map_contains_key to pyspark.sql.functions

2022-08-10 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-39733.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37449
[https://github.com/apache/spark/pull/37449]

> Add map_contains_key to pyspark.sql.functions
> -
>
> Key: SPARK-39733
> URL: https://issues.apache.org/jira/browse/SPARK-39733
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Andrew Ray
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>
> SPARK-37584 added the function map_contains_key to SQL and Scala/Java 
> functions. This JIRA is to track its addition to the PySpark function set for 
> parity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36259) Expose localtimestamp in pyspark.sql.functions

2022-08-10 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-36259.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37449
[https://github.com/apache/spark/pull/37449]

> Expose localtimestamp in pyspark.sql.functions
> --
>
> Key: SPARK-36259
> URL: https://issues.apache.org/jira/browse/SPARK-36259
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
> Fix For: 3.4.0
>
>
> localtimestamp is available in the scala sql functions, but currently not in 
> pyspark



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39734) Add call_udf to pyspark.sql.functions

2022-08-10 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-39734:
-

Assignee: Andrew Ray

> Add call_udf to pyspark.sql.functions
> -
>
> Key: SPARK-39734
> URL: https://issues.apache.org/jira/browse/SPARK-39734
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Andrew Ray
>Assignee: Andrew Ray
>Priority: Minor
>
> Add the call_udf function to PySpark for parity with the Scala/Java function 
> org.apache.spark.sql.functions#call_udf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39734) Add call_udf to pyspark.sql.functions

2022-08-10 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-39734.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37449
[https://github.com/apache/spark/pull/37449]

> Add call_udf to pyspark.sql.functions
> -
>
> Key: SPARK-39734
> URL: https://issues.apache.org/jira/browse/SPARK-39734
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Andrew Ray
>Assignee: Andrew Ray
>Priority: Minor
> Fix For: 3.4.0
>
>
> Add the call_udf function to PySpark for parity with the Scala/Java function 
> org.apache.spark.sql.functions#call_udf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40029) Make pyspark.sql.types examples self-contained

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577946#comment-17577946
 ] 

Apache Spark commented on SPARK-40029:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37465

> Make pyspark.sql.types examples self-contained
> --
>
> Key: SPARK-40029
> URL: https://issues.apache.org/jira/browse/SPARK-40029
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40029) Make pyspark.sql.types examples self-contained

2022-08-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577945#comment-17577945
 ] 

Apache Spark commented on SPARK-40029:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37465

> Make pyspark.sql.types examples self-contained
> --
>
> Key: SPARK-40029
> URL: https://issues.apache.org/jira/browse/SPARK-40029
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40029) Make pyspark.sql.types examples self-contained

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40029:


Assignee: Apache Spark

> Make pyspark.sql.types examples self-contained
> --
>
> Key: SPARK-40029
> URL: https://issues.apache.org/jira/browse/SPARK-40029
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40029) Make pyspark.sql.types examples self-contained

2022-08-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40029:


Assignee: (was: Apache Spark)

> Make pyspark.sql.types examples self-contained
> --
>
> Key: SPARK-40029
> URL: https://issues.apache.org/jira/browse/SPARK-40029
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40032) Support Decimal128 type

2022-08-10 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577942#comment-17577942
 ] 

jiaan.geng commented on SPARK-40032:


We are editing the design doc. 

> Support Decimal128 type
> ---
>
> Key: SPARK-40032
> URL: https://issues.apache.org/jira/browse/SPARK-40032
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
> Attachments: Performance comparison between decimal128 and spark 
> decimal benchmark.pdf
>
>
> Spark SQL today supports the DECIMAL data type. The implementation of Decimal 
> that can hold a BigDecimal or Long.  Decimal provides some operators like +, 
> -, *, / and so on.
> Take the + as example, the implementation show below.
> {code:java}
>   def + (that: Decimal): Decimal = {
> if (decimalVal.eq(null) && that.decimalVal.eq(null) && scale == 
> that.scale) {
>   Decimal(longVal + that.longVal, Math.max(precision, that.precision) + 
> 1, scale)
> } else {
>   Decimal(toBigDecimal.bigDecimal.add(that.toBigDecimal.bigDecimal))
> }
>   }
> {code}
> We can see there exists two addition and call Decimal.apply. The add operator 
> of BigDecimal will construct a new BigDecimal instance.
> The implementation of Decimal.apply will call new to construct a new Decimal 
> instance with the new BigDecimal instance.
> As we know, Decimal instance will hold the new BigDecimal instance.
> If a large table has a Decimal field called 'colA, the execution of 
> SUM('colA) will involve the creation of a large number of Decimal instances 
> and BigDecimal instances. These Decimal instances and BigDecimal instances 
> will lead to garbage collection to occur frequently.
> Decimal128 is a high-performance decimal about 8X more efficient than Java 
> BigDecimal for typical operations. It uses a finite (128 bit) precision and 
> can handle up to decimal(38, X). It is also "mutable" so you can change the 
> contents of an existing object. This helps reduce the cost of new() and 
> garbage collection.
> We have generate a benchmark report for compare Spark Decimal, Java 
> BigDecimal and Decimal128. Please see the attachment.
> In this new feature, we will introduce DECIMAL128 to accelerate decimal 
> calculation.
> h3. Milestone 1 – Spark Decimal equivalency ( The new Decimal type Decimal128 
> meets or exceeds all function of the existing SQL Decimal):
> * Add a new DataType implementation for Decimal128.
> * Support Decimal128 in Dataset/UDF.
> * Decimal128 literals
> * Decimal128 arithmetic(e.g. Decimal128 + Decimal128, Decimal128 - Decimal)
> * Decimal or Math functions/operators: POWER, LOG, Round, etc
> * Cast to and from Decimal128, cast String/Decimal to Decimal128, cast 
> Decimal128 to string (pretty printing)/Decimal, with the * * SQL syntax to 
> specify the types
> * Support sorting Decimal128.
> h3. Milestone 2 – Persistence:
>  * Ability to create tables of type Decimal128
>  * Ability to write to common file formats such as Parquet and JSON.
>  * INSERT, SELECT, UPDATE, MERGE
>  * Discovery
> h3. Milestone 3 – Client support
>  * JDBC support
>  * Hive Thrift server
> h3. Milestone 4 – PySpark and Spark R integration
>  * Python UDF can take and return Decimal128
>  * DataFrame support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40032) Support Decimal128 type

2022-08-10 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-40032:
---
Attachment: Performance comparison between decimal128 and spark decimal 
benchmark.pdf

> Support Decimal128 type
> ---
>
> Key: SPARK-40032
> URL: https://issues.apache.org/jira/browse/SPARK-40032
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
> Attachments: Performance comparison between decimal128 and spark 
> decimal benchmark.pdf
>
>
> Spark SQL today supports the DECIMAL data type. The implementation of Decimal 
> that can hold a BigDecimal or Long.  Decimal provides some operators like +, 
> -, *, / and so on.
> Take the + as example, the implementation show below.
> {code:java}
>   def + (that: Decimal): Decimal = {
> if (decimalVal.eq(null) && that.decimalVal.eq(null) && scale == 
> that.scale) {
>   Decimal(longVal + that.longVal, Math.max(precision, that.precision) + 
> 1, scale)
> } else {
>   Decimal(toBigDecimal.bigDecimal.add(that.toBigDecimal.bigDecimal))
> }
>   }
> {code}
> We can see there exists two addition and call Decimal.apply. The add operator 
> of BigDecimal will construct a new BigDecimal instance.
> The implementation of Decimal.apply will call new to construct a new Decimal 
> instance with the new BigDecimal instance.
> As we know, Decimal instance will hold the new BigDecimal instance.
> If a large table has a Decimal field called 'colA, the execution of 
> SUM('colA) will involve the creation of a large number of Decimal instances 
> and BigDecimal instances. These Decimal instances and BigDecimal instances 
> will lead to garbage collection to occur frequently.
> Decimal128 is a high-performance decimal about 8X more efficient than Java 
> BigDecimal for typical operations. It uses a finite (128 bit) precision and 
> can handle up to decimal(38, X). It is also "mutable" so you can change the 
> contents of an existing object. This helps reduce the cost of new() and 
> garbage collection.
> We have generate a benchmark report for compare Spark Decimal, Java 
> BigDecimal and Decimal128. Please see the attachment.
> In this new feature, we will introduce DECIMAL128 to accelerate decimal 
> calculation.
> h3. Milestone 1 – Spark Decimal equivalency ( The new Decimal type Decimal128 
> meets or exceeds all function of the existing SQL Decimal):
> * Add a new DataType implementation for Decimal128.
> * Support Decimal128 in Dataset/UDF.
> * Decimal128 literals
> * Decimal128 arithmetic(e.g. Decimal128 + Decimal128, Decimal128 - Decimal)
> * Decimal or Math functions/operators: POWER, LOG, Round, etc
> * Cast to and from Decimal128, cast String/Decimal to Decimal128, cast 
> Decimal128 to string (pretty printing)/Decimal, with the * * SQL syntax to 
> specify the types
> * Support sorting Decimal128.
> h3. Milestone 2 – Persistence:
>  * Ability to create tables of type Decimal128
>  * Ability to write to common file formats such as Parquet and JSON.
>  * INSERT, SELECT, UPDATE, MERGE
>  * Discovery
> h3. Milestone 3 – Client support
>  * JDBC support
>  * Hive Thrift server
> h3. Milestone 4 – PySpark and Spark R integration
>  * Python UDF can take and return Decimal128
>  * DataFrame support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40032) Support Decimal128 type

2022-08-10 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-40032:
---
Attachment: (was: Performance comparison between decimal128 and spark 
decimal benchmark.pdf)

> Support Decimal128 type
> ---
>
> Key: SPARK-40032
> URL: https://issues.apache.org/jira/browse/SPARK-40032
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL today supports the DECIMAL data type. The implementation of Decimal 
> that can hold a BigDecimal or Long.  Decimal provides some operators like +, 
> -, *, / and so on.
> Take the + as example, the implementation show below.
> {code:java}
>   def + (that: Decimal): Decimal = {
> if (decimalVal.eq(null) && that.decimalVal.eq(null) && scale == 
> that.scale) {
>   Decimal(longVal + that.longVal, Math.max(precision, that.precision) + 
> 1, scale)
> } else {
>   Decimal(toBigDecimal.bigDecimal.add(that.toBigDecimal.bigDecimal))
> }
>   }
> {code}
> We can see there exists two addition and call Decimal.apply. The add operator 
> of BigDecimal will construct a new BigDecimal instance.
> The implementation of Decimal.apply will call new to construct a new Decimal 
> instance with the new BigDecimal instance.
> As we know, Decimal instance will hold the new BigDecimal instance.
> If a large table has a Decimal field called 'colA, the execution of 
> SUM('colA) will involve the creation of a large number of Decimal instances 
> and BigDecimal instances. These Decimal instances and BigDecimal instances 
> will lead to garbage collection to occur frequently.
> Decimal128 is a high-performance decimal about 8X more efficient than Java 
> BigDecimal for typical operations. It uses a finite (128 bit) precision and 
> can handle up to decimal(38, X). It is also "mutable" so you can change the 
> contents of an existing object. This helps reduce the cost of new() and 
> garbage collection.
> We have generate a benchmark report for compare Spark Decimal, Java 
> BigDecimal and Decimal128. Please see the attachment.
> In this new feature, we will introduce DECIMAL128 to accelerate decimal 
> calculation.
> h3. Milestone 1 – Spark Decimal equivalency ( The new Decimal type Decimal128 
> meets or exceeds all function of the existing SQL Decimal):
> * Add a new DataType implementation for Decimal128.
> * Support Decimal128 in Dataset/UDF.
> * Decimal128 literals
> * Decimal128 arithmetic(e.g. Decimal128 + Decimal128, Decimal128 - Decimal)
> * Decimal or Math functions/operators: POWER, LOG, Round, etc
> * Cast to and from Decimal128, cast String/Decimal to Decimal128, cast 
> Decimal128 to string (pretty printing)/Decimal, with the * * SQL syntax to 
> specify the types
> * Support sorting Decimal128.
> h3. Milestone 2 – Persistence:
>  * Ability to create tables of type Decimal128
>  * Ability to write to common file formats such as Parquet and JSON.
>  * INSERT, SELECT, UPDATE, MERGE
>  * Discovery
> h3. Milestone 3 – Client support
>  * JDBC support
>  * Hive Thrift server
> h3. Milestone 4 – PySpark and Spark R integration
>  * Python UDF can take and return Decimal128
>  * DataFrame support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 132 matches

Mail list logo