[jira] [Resolved] (SPARK-40044) Incorrect target interval type in cast overflow errors
[ https://issues.apache.org/jira/browse/SPARK-40044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40044. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37470 [https://github.com/apache/spark/pull/37470] > Incorrect target interval type in cast overflow errors > -- > > Key: SPARK-40044 > URL: https://issues.apache.org/jira/browse/SPARK-40044 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > The example below shows the issue: > {code:sql} > spark-sql> select CAST(9223372036854775807L AS INTERVAL YEAR TO MONTH); > org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW] The value > 9223372036854775807L of the type "BIGINT" cannot be cast to "INTERVAL MONTH" > due to an overflow. > {code} > The target type "INTERVAL MONTH" is incorrect. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40044) Incorrect target interval type in cast overflow errors
[ https://issues.apache.org/jira/browse/SPARK-40044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40044: Assignee: Apache Spark (was: Max Gekk) > Incorrect target interval type in cast overflow errors > -- > > Key: SPARK-40044 > URL: https://issues.apache.org/jira/browse/SPARK-40044 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > The example below shows the issue: > {code:sql} > spark-sql> select CAST(9223372036854775807L AS INTERVAL YEAR TO MONTH); > org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW] The value > 9223372036854775807L of the type "BIGINT" cannot be cast to "INTERVAL MONTH" > due to an overflow. > {code} > The target type "INTERVAL MONTH" is incorrect. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40044) Incorrect target interval type in cast overflow errors
[ https://issues.apache.org/jira/browse/SPARK-40044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40044: Assignee: Max Gekk (was: Apache Spark) > Incorrect target interval type in cast overflow errors > -- > > Key: SPARK-40044 > URL: https://issues.apache.org/jira/browse/SPARK-40044 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The example below shows the issue: > {code:sql} > spark-sql> select CAST(9223372036854775807L AS INTERVAL YEAR TO MONTH); > org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW] The value > 9223372036854775807L of the type "BIGINT" cannot be cast to "INTERVAL MONTH" > due to an overflow. > {code} > The target type "INTERVAL MONTH" is incorrect. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40044) Incorrect target interval type in cast overflow errors
[ https://issues.apache.org/jira/browse/SPARK-40044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578261#comment-17578261 ] Apache Spark commented on SPARK-40044: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37470 > Incorrect target interval type in cast overflow errors > -- > > Key: SPARK-40044 > URL: https://issues.apache.org/jira/browse/SPARK-40044 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The example below shows the issue: > {code:sql} > spark-sql> select CAST(9223372036854775807L AS INTERVAL YEAR TO MONTH); > org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW] The value > 9223372036854775807L of the type "BIGINT" cannot be cast to "INTERVAL MONTH" > due to an overflow. > {code} > The target type "INTERVAL MONTH" is incorrect. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40044) Incorrect target interval type in cast overflow errors
Max Gekk created SPARK-40044: Summary: Incorrect target interval type in cast overflow errors Key: SPARK-40044 URL: https://issues.apache.org/jira/browse/SPARK-40044 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk The example below shows the issue: {code:sql} spark-sql> select CAST(9223372036854775807L AS INTERVAL YEAR TO MONTH); org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW] The value 9223372036854775807L of the type "BIGINT" cannot be cast to "INTERVAL MONTH" due to an overflow. {code} The target type "INTERVAL MONTH" is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40041) Add Document Parameters for pyspark.sql.window
[ https://issues.apache.org/jira/browse/SPARK-40041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40041: Assignee: Qian Sun > Add Document Parameters for pyspark.sql.window > -- > > Key: SPARK-40041 > URL: https://issues.apache.org/jira/browse/SPARK-40041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Qian Sun >Assignee: Qian Sun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40041) Add Document Parameters for pyspark.sql.window
[ https://issues.apache.org/jira/browse/SPARK-40041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40041. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37476 [https://github.com/apache/spark/pull/37476] > Add Document Parameters for pyspark.sql.window > -- > > Key: SPARK-40041 > URL: https://issues.apache.org/jira/browse/SPARK-40041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Qian Sun >Assignee: Qian Sun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40020) centralize the code of qualifying identifiers in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-40020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40020: --- Assignee: Wenchen Fan > centralize the code of qualifying identifiers in SessionCatalog > --- > > Key: SPARK-40020 > URL: https://issues.apache.org/jira/browse/SPARK-40020 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40020) centralize the code of qualifying identifiers in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-40020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40020. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37415 [https://github.com/apache/spark/pull/37415] > centralize the code of qualifying identifiers in SessionCatalog > --- > > Key: SPARK-40020 > URL: https://issues.apache.org/jira/browse/SPARK-40020 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40026) spark-shell can not instantiate object whose class is defined in REPL dynamically
[ https://issues.apache.org/jira/browse/SPARK-40026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40026: - Priority: Major (was: Blocker) > spark-shell can not instantiate object whose class is defined in REPL > dynamically > - > > Key: SPARK-40026 > URL: https://issues.apache.org/jira/browse/SPARK-40026 > Project: Spark > Issue Type: Question > Components: Spark Shell >Affects Versions: 2.4.8, 3.0.3 > Environment: Spark2.3.x ~ Spark3.0.x > Scala2.11.x ~ Scala2.13.x > Java 8 >Reporter: Kernel Force >Priority: Major > > spark-shell throws {{NoSuchMethodException}} if I define a class in REPL and > then call {{newInstance}} via reflection. > {code:scala} > Spark context available as 'sc' (master = yarn, app id = > application_1656488084960_0162). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.3 > /_/ > > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_141) > Type in expressions to have them evaluated. > Type :help for more information. > scala> class Demo { > | def demo(s: String): Unit = println(s) > | } > defined class Demo > scala> classOf[Demo].newInstance().demo("OK") > java.lang.InstantiationException: Demo > at java.lang.Class.newInstance(Class.java:427) > ... 47 elided > Caused by: java.lang.NoSuchMethodException: Demo.() > at java.lang.Class.getConstructor0(Class.java:3082) > at java.lang.Class.newInstance(Class.java:412) > ... 47 more > {code} > But the same code works fine in native scala REPL: > {code:scala} > Welcome to Scala 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131). > Type in expressions for evaluation. Or try :help. > scala> class Demo { > | def demo(s: String): Unit = println(s) > | } > defined class Demo > scala> classOf[Demo].newInstance().demo("OK") > OK > {code} > What's the difference between spark-shell REPL and native scala REPL? > I guess the Demo class might be treated as inner class in spark-shell REPL. > But ... how to solve the problem? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40026) spark-shell can not instantiate object whose class is defined in REPL dynamically
[ https://issues.apache.org/jira/browse/SPARK-40026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40026: - Issue Type: Bug (was: Question) > spark-shell can not instantiate object whose class is defined in REPL > dynamically > - > > Key: SPARK-40026 > URL: https://issues.apache.org/jira/browse/SPARK-40026 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.4.8, 3.0.3 > Environment: Spark2.3.x ~ Spark3.0.x > Scala2.11.x ~ Scala2.13.x > Java 8 >Reporter: Kernel Force >Priority: Major > > spark-shell throws {{NoSuchMethodException}} if I define a class in REPL and > then call {{newInstance}} via reflection. > {code:scala} > Spark context available as 'sc' (master = yarn, app id = > application_1656488084960_0162). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.3 > /_/ > > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_141) > Type in expressions to have them evaluated. > Type :help for more information. > scala> class Demo { > | def demo(s: String): Unit = println(s) > | } > defined class Demo > scala> classOf[Demo].newInstance().demo("OK") > java.lang.InstantiationException: Demo > at java.lang.Class.newInstance(Class.java:427) > ... 47 elided > Caused by: java.lang.NoSuchMethodException: Demo.() > at java.lang.Class.getConstructor0(Class.java:3082) > at java.lang.Class.newInstance(Class.java:412) > ... 47 more > {code} > But the same code works fine in native scala REPL: > {code:scala} > Welcome to Scala 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131). > Type in expressions for evaluation. Or try :help. > scala> class Demo { > | def demo(s: String): Unit = println(s) > | } > defined class Demo > scala> classOf[Demo].newInstance().demo("OK") > OK > {code} > What's the difference between spark-shell REPL and native scala REPL? > I guess the Demo class might be treated as inner class in spark-shell REPL. > But ... how to solve the problem? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40043) Document DataStreamWriter.toTable and DataStreamReader.table
[ https://issues.apache.org/jira/browse/SPARK-40043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578234#comment-17578234 ] Apache Spark commented on SPARK-40043: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37477 > Document DataStreamWriter.toTable and DataStreamReader.table > > > Key: SPARK-40043 > URL: https://issues.apache.org/jira/browse/SPARK-40043 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-33836 added > DataStreamWriter.toTable and DataStreamReader.table but they are mistakenly > not documented. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40043) Document DataStreamWriter.toTable and DataStreamReader.table
[ https://issues.apache.org/jira/browse/SPARK-40043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578233#comment-17578233 ] Apache Spark commented on SPARK-40043: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37477 > Document DataStreamWriter.toTable and DataStreamReader.table > > > Key: SPARK-40043 > URL: https://issues.apache.org/jira/browse/SPARK-40043 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-33836 added > DataStreamWriter.toTable and DataStreamReader.table but they are mistakenly > not documented. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40043) Document DataStreamWriter.toTable and DataStreamReader.table
[ https://issues.apache.org/jira/browse/SPARK-40043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40043: Assignee: Apache Spark > Document DataStreamWriter.toTable and DataStreamReader.table > > > Key: SPARK-40043 > URL: https://issues.apache.org/jira/browse/SPARK-40043 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-33836 added > DataStreamWriter.toTable and DataStreamReader.table but they are mistakenly > not documented. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40043) Document DataStreamWriter.toTable and DataStreamReader.table
[ https://issues.apache.org/jira/browse/SPARK-40043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40043: Assignee: (was: Apache Spark) > Document DataStreamWriter.toTable and DataStreamReader.table > > > Key: SPARK-40043 > URL: https://issues.apache.org/jira/browse/SPARK-40043 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-33836 added > DataStreamWriter.toTable and DataStreamReader.table but they are mistakenly > not documented. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40043) Document DataStreamWriter.toTable and DataStreamReader.table
Hyukjin Kwon created SPARK-40043: Summary: Document DataStreamWriter.toTable and DataStreamReader.table Key: SPARK-40043 URL: https://issues.apache.org/jira/browse/SPARK-40043 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 3.2.2, 3.3.0, 3.1.3, 3.4.0 Reporter: Hyukjin Kwon https://issues.apache.org/jira/browse/SPARK-33836 added DataStreamWriter.toTable and DataStreamReader.table but they are mistakenly not documented. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40041) Add Document Parameters for pyspark.sql.window
[ https://issues.apache.org/jira/browse/SPARK-40041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578230#comment-17578230 ] Apache Spark commented on SPARK-40041: -- User 'dcoliversun' has created a pull request for this issue: https://github.com/apache/spark/pull/37476 > Add Document Parameters for pyspark.sql.window > -- > > Key: SPARK-40041 > URL: https://issues.apache.org/jira/browse/SPARK-40041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Qian Sun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40041) Add Document Parameters for pyspark.sql.window
[ https://issues.apache.org/jira/browse/SPARK-40041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40041: Assignee: Apache Spark > Add Document Parameters for pyspark.sql.window > -- > > Key: SPARK-40041 > URL: https://issues.apache.org/jira/browse/SPARK-40041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Qian Sun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40041) Add Document Parameters for pyspark.sql.window
[ https://issues.apache.org/jira/browse/SPARK-40041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40041: Assignee: (was: Apache Spark) > Add Document Parameters for pyspark.sql.window > -- > > Key: SPARK-40041 > URL: https://issues.apache.org/jira/browse/SPARK-40041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Qian Sun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40041) Add Document Parameters for pyspark.sql.window
[ https://issues.apache.org/jira/browse/SPARK-40041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578229#comment-17578229 ] Apache Spark commented on SPARK-40041: -- User 'dcoliversun' has created a pull request for this issue: https://github.com/apache/spark/pull/37476 > Add Document Parameters for pyspark.sql.window > -- > > Key: SPARK-40041 > URL: https://issues.apache.org/jira/browse/SPARK-40041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Qian Sun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40042) Make pyspark.sql.streaming.query examples self-contained
Qian Sun created SPARK-40042: Summary: Make pyspark.sql.streaming.query examples self-contained Key: SPARK-40042 URL: https://issues.apache.org/jira/browse/SPARK-40042 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Qian Sun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39895) pyspark drop doesn't accept *cols
[ https://issues.apache.org/jira/browse/SPARK-39895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39895: Assignee: Santosh Pingale > pyspark drop doesn't accept *cols > -- > > Key: SPARK-39895 > URL: https://issues.apache.org/jira/browse/SPARK-39895 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.3, 3.3.0, 3.2.2 >Reporter: Santosh Pingale >Assignee: Santosh Pingale >Priority: Minor > Fix For: 3.4.0 > > > Pyspark dataframe drop has following signature: > {color:#4c9aff}{{def drop(self, *cols: "ColumnOrName") -> > "DataFrame":}}{color} > However when we try to pass multiple Column types to drop function it raises > TypeError > {{each col in the param list should be a string}} > *Minimal reproducible example:* > {color:#4c9aff}values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), > ("id_1", 3, 3), ("id_2", 4, 3)]{color} > {color:#4c9aff}df = spark.createDataFrame(values, "id string, point int, > count int"){color} > |– id: string (nullable = true)| > |– point: integer (nullable = true)| > |– count: integer (nullable = true)| > {color:#4c9aff}{{df.drop(df.point, df.count)}}{color} > {quote}{color:#505f79}/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py > in drop(self, *cols){color} > {color:#505f79}2537 for col in cols:{color} > {color:#505f79}2538 if not isinstance(col, str):{color} > {color:#505f79}-> 2539 raise TypeError("each col in the param list should be > a string"){color} > {color:#505f79}2540 jdf = self._jdf.drop(self._jseq(cols)){color} > {color:#505f79}2541{color} > {color:#505f79}TypeError: each col in the param list should be a string{color} > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39895) pyspark drop doesn't accept *cols
[ https://issues.apache.org/jira/browse/SPARK-39895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39895. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37335 [https://github.com/apache/spark/pull/37335] > pyspark drop doesn't accept *cols > -- > > Key: SPARK-39895 > URL: https://issues.apache.org/jira/browse/SPARK-39895 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.3, 3.3.0, 3.2.2 >Reporter: Santosh Pingale >Priority: Minor > Fix For: 3.4.0 > > > Pyspark dataframe drop has following signature: > {color:#4c9aff}{{def drop(self, *cols: "ColumnOrName") -> > "DataFrame":}}{color} > However when we try to pass multiple Column types to drop function it raises > TypeError > {{each col in the param list should be a string}} > *Minimal reproducible example:* > {color:#4c9aff}values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), > ("id_1", 3, 3), ("id_2", 4, 3)]{color} > {color:#4c9aff}df = spark.createDataFrame(values, "id string, point int, > count int"){color} > |– id: string (nullable = true)| > |– point: integer (nullable = true)| > |– count: integer (nullable = true)| > {color:#4c9aff}{{df.drop(df.point, df.count)}}{color} > {quote}{color:#505f79}/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py > in drop(self, *cols){color} > {color:#505f79}2537 for col in cols:{color} > {color:#505f79}2538 if not isinstance(col, str):{color} > {color:#505f79}-> 2539 raise TypeError("each col in the param list should be > a string"){color} > {color:#505f79}2540 jdf = self._jdf.drop(self._jseq(cols)){color} > {color:#505f79}2541{color} > {color:#505f79}TypeError: each col in the param list should be a string{color} > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40041) Add Document Parameters for pyspark.sql.window
Qian Sun created SPARK-40041: Summary: Add Document Parameters for pyspark.sql.window Key: SPARK-40041 URL: https://issues.apache.org/jira/browse/SPARK-40041 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Qian Sun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39708) ALS Model Loading
[ https://issues.apache.org/jira/browse/SPARK-39708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578220#comment-17578220 ] Ruifeng Zheng commented on SPARK-39708: --- if you are using `pyspark.mllib.recommendation`, try to load it with `MatrixFactorizationModel.load(sc, path)` https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.mllib.recommendation.MatrixFactorizationModel.html#pyspark.mllib.recommendation.MatrixFactorizationModel > ALS Model Loading > - > > Key: SPARK-39708 > URL: https://issues.apache.org/jira/browse/SPARK-39708 > Project: Spark > Issue Type: Question > Components: PySpark, Spark Submit >Affects Versions: 3.2.0 >Reporter: zehra >Priority: Critical > Labels: model, pyspark > > I have an ALS model and saved it with these codes: > {code:java} > als_path = "saved_models/best" > best_model.save(sc, path= als_path){code} > However, when I try to load this model, it gives this error message: > > {code:java} > ---> 10 model2 = ALS.load(als_path) > > File /usr/local/spark/python/pyspark/ml/util.py:332, in > MLReadable.load(cls, path) > 329 @classmethod > 330 def load(cls, path): > 331 """Reads an ML instance from the input path, a shortcut of > `read().load(path)`.""" > --> 332 return cls.read().load(path) > > File /usr/local/spark/python/pyspark/ml/util.py:282, in > JavaMLReader.load(self, path) > 280 if not isinstance(path, str): > 281 raise TypeError("path should be a string, got type %s" % > type(path)) > --> 282 java_obj = self._jread.load(path) > 283 if not hasattr(self._clazz, "_from_java"): > 284 raise NotImplementedError("This Java ML type cannot be loaded > into Python currently: %r" > 285 % self._clazz) > > File > /usr/local/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py:1321, > in JavaMember.__call__(self, *args) > 1315 command = proto.CALL_COMMAND_NAME +\ > 1316 self.command_header +\ > 1317 args_command +\ > 1318 proto.END_COMMAND_PART > 1320 answer = self.gateway_client.send_command(command) > -> 1321 return_value = get_return_value( > 1322 answer, self.gateway_client, self.target_id, self.name) > 1324 for temp_arg in temp_args: > 1325 temp_arg._detach() > > File /usr/local/spark/python/pyspark/sql/utils.py:111, in > capture_sql_exception..deco(*a, **kw) > 109 def deco(*a, **kw): > 110 try: > --> 111 return f(*a, **kw) > 112 except py4j.protocol.Py4JJavaError as e: > 113 converted = convert_exception(e.java_exception) > > File > /usr/local/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py:326, in > get_return_value(answer, gateway_client, target_id, name) > 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) > 325 if answer[1] == REFERENCE_TYPE: > --> 326 raise Py4JJavaError( > 327 "An error occurred while calling {0}{1}{2}.\n". > 328 format(target_id, ".", name), value) > 329 else: > 330 raise Py4JError( > 331 "An error occurred while calling {0}{1}{2}. > Trace:\n{3}\n". > 332 format(target_id, ".", name, value)) > > Py4JJavaError: An error occurred while calling o372.load. > : org.json4s.MappingException: Did not find value which can be converted > into java.lang.String > at org.json4s.reflect.package$.fail(package.scala:53) > at org.json4s.Extraction$.$anonfun$convert$2(Extraction.scala:881) > at scala.Option.getOrElse(Option.scala:189) > at org.json4s.Extraction$.convert(Extraction.scala:881) > at org.json4s.Extraction$.$anonfun$extract$10(Extraction.scala:456) > at > org.json4s.Extraction$.$anonfun$customOrElse$1(Extraction.scala:780) > > {code} > > I both tried to use `ALS.load` or `ALSModel.load` as shown in the Apache > spark documentation: > [https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html#:~:text=als_path%20%3D%20temp_path%20%2B%20%22/als%22%0A%3E%3E%3E][1] > > [1]: > https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html#:~:text=als_path%20%3D%20temp_path%20%2B%20%22/als%22%0A%3E%3E%3E -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39708) ALS Model Loading
[ https://issues.apache.org/jira/browse/SPARK-39708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578219#comment-17578219 ] Ruifeng Zheng commented on SPARK-39708: --- ??File /usr/local/spark/python/pyspark/ml/util.py:332, in MLReadable.load(cls, path)?? it seems that you were useing the `pyspark.ml`, which is DataFrame based; ??best_model.save(sc, path= als_path)?? then the `pyspark.ml.recommendation.ALSModel`'s save method, do not have `sc` as a parameter https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.ml.recommendation.ALSModel.html#pyspark.ml.recommendation.ALSModel.save > ALS Model Loading > - > > Key: SPARK-39708 > URL: https://issues.apache.org/jira/browse/SPARK-39708 > Project: Spark > Issue Type: Question > Components: PySpark, Spark Submit >Affects Versions: 3.2.0 >Reporter: zehra >Priority: Critical > Labels: model, pyspark > > I have an ALS model and saved it with these codes: > {code:java} > als_path = "saved_models/best" > best_model.save(sc, path= als_path){code} > However, when I try to load this model, it gives this error message: > > {code:java} > ---> 10 model2 = ALS.load(als_path) > > File /usr/local/spark/python/pyspark/ml/util.py:332, in > MLReadable.load(cls, path) > 329 @classmethod > 330 def load(cls, path): > 331 """Reads an ML instance from the input path, a shortcut of > `read().load(path)`.""" > --> 332 return cls.read().load(path) > > File /usr/local/spark/python/pyspark/ml/util.py:282, in > JavaMLReader.load(self, path) > 280 if not isinstance(path, str): > 281 raise TypeError("path should be a string, got type %s" % > type(path)) > --> 282 java_obj = self._jread.load(path) > 283 if not hasattr(self._clazz, "_from_java"): > 284 raise NotImplementedError("This Java ML type cannot be loaded > into Python currently: %r" > 285 % self._clazz) > > File > /usr/local/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py:1321, > in JavaMember.__call__(self, *args) > 1315 command = proto.CALL_COMMAND_NAME +\ > 1316 self.command_header +\ > 1317 args_command +\ > 1318 proto.END_COMMAND_PART > 1320 answer = self.gateway_client.send_command(command) > -> 1321 return_value = get_return_value( > 1322 answer, self.gateway_client, self.target_id, self.name) > 1324 for temp_arg in temp_args: > 1325 temp_arg._detach() > > File /usr/local/spark/python/pyspark/sql/utils.py:111, in > capture_sql_exception..deco(*a, **kw) > 109 def deco(*a, **kw): > 110 try: > --> 111 return f(*a, **kw) > 112 except py4j.protocol.Py4JJavaError as e: > 113 converted = convert_exception(e.java_exception) > > File > /usr/local/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py:326, in > get_return_value(answer, gateway_client, target_id, name) > 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) > 325 if answer[1] == REFERENCE_TYPE: > --> 326 raise Py4JJavaError( > 327 "An error occurred while calling {0}{1}{2}.\n". > 328 format(target_id, ".", name), value) > 329 else: > 330 raise Py4JError( > 331 "An error occurred while calling {0}{1}{2}. > Trace:\n{3}\n". > 332 format(target_id, ".", name, value)) > > Py4JJavaError: An error occurred while calling o372.load. > : org.json4s.MappingException: Did not find value which can be converted > into java.lang.String > at org.json4s.reflect.package$.fail(package.scala:53) > at org.json4s.Extraction$.$anonfun$convert$2(Extraction.scala:881) > at scala.Option.getOrElse(Option.scala:189) > at org.json4s.Extraction$.convert(Extraction.scala:881) > at org.json4s.Extraction$.$anonfun$extract$10(Extraction.scala:456) > at > org.json4s.Extraction$.$anonfun$customOrElse$1(Extraction.scala:780) > > {code} > > I both tried to use `ALS.load` or `ALSModel.load` as shown in the Apache > spark documentation: > [https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html#:~:text=als_path%20%3D%20temp_path%20%2B%20%22/als%22%0A%3E%3E%3E][1] > > [1]: > https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html#:~:text=als_path%20%3D%20temp_path%20%2B%20%22/als%22%0A%3E%3E%3E -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Updated] (SPARK-40021) `Cancel workflow` can not cancel some jobs
[ https://issues.apache.org/jira/browse/SPARK-40021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-40021: -- Attachment: Screen Shot 2022-08-11 at 10.07.24.png > `Cancel workflow` can not cancel some jobs > -- > > Key: SPARK-40021 > URL: https://issues.apache.org/jira/browse/SPARK-40021 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > Attachments: Screen Shot 2022-08-09 at 22.49.13.png, Screen Shot > 2022-08-11 at 10.07.24.png > > > I have been observing this behavior for a long time: > sometime I want to cancel the workflow to release resource for other > workflows, so I click the {{*{color:red}Cancel workflow{color}*}} button, > most jobs in current workflow will be canceled in several minutes, but some > jobs can never be killed, including: > * Run / Build modules: pyspark-sql, pyspark-mllib, pyspark-resource > * Run / Build modules: pyspark-core, pyspark-streaming, pyspark-ml > * Run / Build modules: pyspark-pandas > * Run / Build modules: pyspark-pandas-slow > * Run / Linters, licenses, dependencies and documentation generation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40021) `Cancel workflow` can not cancel some jobs
[ https://issues.apache.org/jira/browse/SPARK-40021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578212#comment-17578212 ] Ruifeng Zheng edited comment on SPARK-40021 at 8/11/22 2:08 AM: ??Did you try to kill those Python jobs individually??? sorry but I don't find a place to cancel a job individually. I only tried the `cancel workflow` and `cancel run`, which should cancel the whole workflow was (Author: podongfeng): Did you try to kill those Python jobs individually? > `Cancel workflow` can not cancel some jobs > -- > > Key: SPARK-40021 > URL: https://issues.apache.org/jira/browse/SPARK-40021 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > Attachments: Screen Shot 2022-08-09 at 22.49.13.png, Screen Shot > 2022-08-11 at 10.07.24.png > > > I have been observing this behavior for a long time: > sometime I want to cancel the workflow to release resource for other > workflows, so I click the {{*{color:red}Cancel workflow{color}*}} button, > most jobs in current workflow will be canceled in several minutes, but some > jobs can never be killed, including: > * Run / Build modules: pyspark-sql, pyspark-mllib, pyspark-resource > * Run / Build modules: pyspark-core, pyspark-streaming, pyspark-ml > * Run / Build modules: pyspark-pandas > * Run / Build modules: pyspark-pandas-slow > * Run / Linters, licenses, dependencies and documentation generation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40040) Push local limit to both sides if join condition is empty
[ https://issues.apache.org/jira/browse/SPARK-40040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40040: Assignee: Apache Spark > Push local limit to both sides if join condition is empty > - > > Key: SPARK-40040 > URL: https://issues.apache.org/jira/browse/SPARK-40040 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40040) Push local limit to both sides if join condition is empty
[ https://issues.apache.org/jira/browse/SPARK-40040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40040: Assignee: (was: Apache Spark) > Push local limit to both sides if join condition is empty > - > > Key: SPARK-40040 > URL: https://issues.apache.org/jira/browse/SPARK-40040 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40021) `Cancel workflow` can not cancel some jobs
[ https://issues.apache.org/jira/browse/SPARK-40021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578212#comment-17578212 ] Ruifeng Zheng commented on SPARK-40021: --- Did you try to kill those Python jobs individually? > `Cancel workflow` can not cancel some jobs > -- > > Key: SPARK-40021 > URL: https://issues.apache.org/jira/browse/SPARK-40021 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > Attachments: Screen Shot 2022-08-09 at 22.49.13.png > > > I have been observing this behavior for a long time: > sometime I want to cancel the workflow to release resource for other > workflows, so I click the {{*{color:red}Cancel workflow{color}*}} button, > most jobs in current workflow will be canceled in several minutes, but some > jobs can never be killed, including: > * Run / Build modules: pyspark-sql, pyspark-mllib, pyspark-resource > * Run / Build modules: pyspark-core, pyspark-streaming, pyspark-ml > * Run / Build modules: pyspark-pandas > * Run / Build modules: pyspark-pandas-slow > * Run / Linters, licenses, dependencies and documentation generation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40040) Push local limit to both sides if join condition is empty
[ https://issues.apache.org/jira/browse/SPARK-40040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578211#comment-17578211 ] Apache Spark commented on SPARK-40040: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37475 > Push local limit to both sides if join condition is empty > - > > Key: SPARK-40040 > URL: https://issues.apache.org/jira/browse/SPARK-40040 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40040) Push local limit to both sides if join condition is empty
[ https://issues.apache.org/jira/browse/SPARK-40040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578210#comment-17578210 ] Apache Spark commented on SPARK-40040: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37475 > Push local limit to both sides if join condition is empty > - > > Key: SPARK-40040 > URL: https://issues.apache.org/jira/browse/SPARK-40040 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40040) Push local limit to both sides if join condition is empty
[ https://issues.apache.org/jira/browse/SPARK-40040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-40040: Summary: Push local limit to both sides if join condition is empty (was: Push local limit through outer join if join condition is empty) > Push local limit to both sides if join condition is empty > - > > Key: SPARK-40040 > URL: https://issues.apache.org/jira/browse/SPARK-40040 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40040) Push local limit through outer join if join condition is empty
Yuming Wang created SPARK-40040: --- Summary: Push local limit through outer join if join condition is empty Key: SPARK-40040 URL: https://issues.apache.org/jira/browse/SPARK-40040 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578195#comment-17578195 ] Jungtaek Lim commented on SPARK-40039: -- Very interesting one to see! (Disclaimer: Abortable was something I worked with Steve.) Have you gone through some benchmarks to figure out this works with small to big files? One thing I wonder is whether multipart upload performs well with tiny file. We have lots of tiny files in checkpoint and all files could be pretty tiny for stateless query. > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578193#comment-17578193 ] Apache Spark commented on SPARK-40039: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/37474 > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40039: Assignee: Attila Zsolt Piros (was: Apache Spark) > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578192#comment-17578192 ] Apache Spark commented on SPARK-40039: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/37474 > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40039: Assignee: Apache Spark (was: Attila Zsolt Piros) > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Assignee: Apache Spark >Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-40039: --- Summary: Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface (was: Introducing checkpoint file manager based on Hadoop's Abortable interface) > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40039) Introducing checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578189#comment-17578189 ] Attila Zsolt Piros commented on SPARK-40039: I am working on this. > Introducing checkpoint file manager based on Hadoop's Abortable interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40039) Introducing checkpoint file manager based on Hadoop's Abortable interface
Attila Zsolt Piros created SPARK-40039: -- Summary: Introducing checkpoint file manager based on Hadoop's Abortable interface Key: SPARK-40039 URL: https://issues.apache.org/jira/browse/SPARK-40039 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Attila Zsolt Piros Assignee: Attila Zsolt Piros Currently on S3 the checkpoint file manager (called FileContextBasedCheckpointFileManager) is based on rename. So when a file is opened for an atomic stream a temporary file used instead and when the stream is committed the file is renamed. But on S3 a rename will be a file copy. So it has some serious performance implication. But on Hadoop 3 there is new interface introduce called *Abortable* and *S3AFileSystem* has this capability which is implemented by on top S3's multipart upload. So when the file is committed a POST is sent ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) and when aborted a DELETE will be send ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39983) Should not cache unserialized broadcast relations on the driver
[ https://issues.apache.org/jira/browse/SPARK-39983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-39983. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37413 [https://github.com/apache/spark/pull/37413] > Should not cache unserialized broadcast relations on the driver > --- > > Key: SPARK-39983 > URL: https://issues.apache.org/jira/browse/SPARK-39983 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Alex Balikov >Assignee: Alex Balikov >Priority: Minor > Fix For: 3.4.0 > > > In TorrentBroadcast.writeBlocks we store the unserialized broadcast object in > addition to the serialized version of it - > {code:java} > private def writeBlocks(value: T): Int = { > import StorageLevel._ > // Store a copy of the broadcast variable in the driver so that tasks run > on the driver > // do not create a duplicate copy of the broadcast variable's value. > val blockManager = SparkEnv.get.blockManager > if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, > tellMaster = false)) { > throw new SparkException(s"Failed to store $broadcastId in > BlockManager") > } > {code} > In case of broadcast relations, these objects can be fairly large (60MB in > one observed case) and are not strictly necessary on the driver. > Add the option to not keep the unserialized versions of the objects. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39983) Should not cache unserialized broadcast relations on the driver
[ https://issues.apache.org/jira/browse/SPARK-39983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-39983: -- Assignee: Alex Balikov > Should not cache unserialized broadcast relations on the driver > --- > > Key: SPARK-39983 > URL: https://issues.apache.org/jira/browse/SPARK-39983 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Alex Balikov >Assignee: Alex Balikov >Priority: Minor > > In TorrentBroadcast.writeBlocks we store the unserialized broadcast object in > addition to the serialized version of it - > {code:java} > private def writeBlocks(value: T): Int = { > import StorageLevel._ > // Store a copy of the broadcast variable in the driver so that tasks run > on the driver > // do not create a duplicate copy of the broadcast variable's value. > val blockManager = SparkEnv.get.blockManager > if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, > tellMaster = false)) { > throw new SparkException(s"Failed to store $broadcastId in > BlockManager") > } > {code} > In case of broadcast relations, these objects can be fairly large (60MB in > one observed case) and are not strictly necessary on the driver. > Add the option to not keep the unserialized versions of the objects. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578168#comment-17578168 ] Apache Spark commented on SPARK-40037: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/37473 > Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0 > --- > > Key: SPARK-40037 > URL: https://issues.apache.org/jira/browse/SPARK-40037 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327] > [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703] > [releases log|https://github.com/google/tink/releases/tag/v1.7.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40038) spark.sql.files.maxPartitionBytes does not observe on-disk compression
[ https://issues.apache.org/jira/browse/SPARK-40038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578169#comment-17578169 ] RJ Marcus commented on SPARK-40038: --- the screenshot with spillage didn't have the stage fully completed when I made the capture, which is why the total data input / output may not look correct (should be 230GB) > spark.sql.files.maxPartitionBytes does not observe on-disk compression > -- > > Key: SPARK-40038 > URL: https://issues.apache.org/jira/browse/SPARK-40038 > Project: Spark > Issue Type: Question > Components: Input/Output, Optimizer, PySpark, SQL >Affects Versions: 3.2.0 > Environment: files: > - ORC with snappy compression > - 232 GB files on disk > - 1800 files on disk (pretty sure no individual file is over 200MB) > - 9 partitions on disk > cluster: > - EMR 6.6.0 (spark 3.2.0) > - cluster: 288 vCPU (executors), 1.1TB memory (executors) > OS info: > LSB Version: > :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch > Distributor ID: Amazon > Description: Amazon Linux release 2 (Karoo) > Release: 2 > Codename: Karoo >Reporter: RJ Marcus >Priority: Major > Attachments: Screenshot from 2022-08-10 16-50-37.png, Screenshot from > 2022-08-10 16-59-56.png > > > Why does `spark.sql.files.maxPartitionBytes` estimate the number of > partitions based on {_}file size on disk instead of the uncompressed file > size{_}? > For example I have a dataset that is 213GB on disk. When I read this in to my > application I get 2050 partitions based on the default value of 128MB for > maxPartitionBytes. My application is a simple broadcast index join that adds > 1 column to the dataframe and writes it out. There is no shuffle. > Initially the size of input /output records seem ok, but I still get a large > amount of memory "spill" on the executors. I believe this is due to the data > being highly compressed and each partition becoming too big when it is > deserialized to work on in memory. > !image-2022-08-10-16-59-05-233.png! > (If I try to do a repartition immediately after reading I still see the first > stage spilling memory to disk, so that is not the right solution or what I'm > interested in.) > Instead, I attempt to lower maxPartitionBytes by the (average) compression > ratio of my files (about 7x, so let's round up to 8). So I set > maxPartitionBytes=16MB. At this point I see that spark is reading in from > the file in 12-28 MB chunks. Now it makes 14316 partitions on the initial > file read and completes with no spillage. > !image-2022-08-10-16-59-59-778.png! > > Is there something I'm missing here? Is this just intended behavior? How can > I tune my partition size correctly for my application when I do not know how > much the data will be compressed ahead of time? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578167#comment-17578167 ] Apache Spark commented on SPARK-40037: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/37473 > Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0 > --- > > Key: SPARK-40037 > URL: https://issues.apache.org/jira/browse/SPARK-40037 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327] > [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703] > [releases log|https://github.com/google/tink/releases/tag/v1.7.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40038) spark.sql.files.maxPartitionBytes does not observe on-disk compression
[ https://issues.apache.org/jira/browse/SPARK-40038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] RJ Marcus updated SPARK-40038: -- Attachment: Screenshot from 2022-08-10 16-59-56.png Screenshot from 2022-08-10 16-50-37.png > spark.sql.files.maxPartitionBytes does not observe on-disk compression > -- > > Key: SPARK-40038 > URL: https://issues.apache.org/jira/browse/SPARK-40038 > Project: Spark > Issue Type: Question > Components: Input/Output, Optimizer, PySpark, SQL >Affects Versions: 3.2.0 > Environment: files: > - ORC with snappy compression > - 232 GB files on disk > - 1800 files on disk (pretty sure no individual file is over 200MB) > - 9 partitions on disk > cluster: > - EMR 6.6.0 (spark 3.2.0) > - cluster: 288 vCPU (executors), 1.1TB memory (executors) > OS info: > LSB Version: > :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch > Distributor ID: Amazon > Description: Amazon Linux release 2 (Karoo) > Release: 2 > Codename: Karoo >Reporter: RJ Marcus >Priority: Major > Attachments: Screenshot from 2022-08-10 16-50-37.png, Screenshot from > 2022-08-10 16-59-56.png > > > Why does `spark.sql.files.maxPartitionBytes` estimate the number of > partitions based on {_}file size on disk instead of the uncompressed file > size{_}? > For example I have a dataset that is 213GB on disk. When I read this in to my > application I get 2050 partitions based on the default value of 128MB for > maxPartitionBytes. My application is a simple broadcast index join that adds > 1 column to the dataframe and writes it out. There is no shuffle. > Initially the size of input /output records seem ok, but I still get a large > amount of memory "spill" on the executors. I believe this is due to the data > being highly compressed and each partition becoming too big when it is > deserialized to work on in memory. > !image-2022-08-10-16-59-05-233.png! > (If I try to do a repartition immediately after reading I still see the first > stage spilling memory to disk, so that is not the right solution or what I'm > interested in.) > Instead, I attempt to lower maxPartitionBytes by the (average) compression > ratio of my files (about 7x, so let's round up to 8). So I set > maxPartitionBytes=16MB. At this point I see that spark is reading in from > the file in 12-28 MB chunks. Now it makes 14316 partitions on the initial > file read and completes with no spillage. > !image-2022-08-10-16-59-59-778.png! > > Is there something I'm missing here? Is this just intended behavior? How can > I tune my partition size correctly for my application when I do not know how > much the data will be compressed ahead of time? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40037: Assignee: Apache Spark > Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0 > --- > > Key: SPARK-40037 > URL: https://issues.apache.org/jira/browse/SPARK-40037 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Apache Spark >Priority: Major > > [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327] > [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703] > [releases log|https://github.com/google/tink/releases/tag/v1.7.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40037: Assignee: (was: Apache Spark) > Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0 > --- > > Key: SPARK-40037 > URL: https://issues.apache.org/jira/browse/SPARK-40037 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327] > [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703] > [releases log|https://github.com/google/tink/releases/tag/v1.7.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40038) spark.sql.files.maxPartitionBytes does not observe on-disk compression
RJ Marcus created SPARK-40038: - Summary: spark.sql.files.maxPartitionBytes does not observe on-disk compression Key: SPARK-40038 URL: https://issues.apache.org/jira/browse/SPARK-40038 Project: Spark Issue Type: Question Components: Input/Output, Optimizer, PySpark, SQL Affects Versions: 3.2.0 Environment: files: - ORC with snappy compression - 232 GB files on disk - 1800 files on disk (pretty sure no individual file is over 200MB) - 9 partitions on disk cluster: - EMR 6.6.0 (spark 3.2.0) - cluster: 288 vCPU (executors), 1.1TB memory (executors) OS info: LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch Distributor ID: Amazon Description: Amazon Linux release 2 (Karoo) Release: 2 Codename: Karoo Reporter: RJ Marcus Why does `spark.sql.files.maxPartitionBytes` estimate the number of partitions based on {_}file size on disk instead of the uncompressed file size{_}? For example I have a dataset that is 213GB on disk. When I read this in to my application I get 2050 partitions based on the default value of 128MB for maxPartitionBytes. My application is a simple broadcast index join that adds 1 column to the dataframe and writes it out. There is no shuffle. Initially the size of input /output records seem ok, but I still get a large amount of memory "spill" on the executors. I believe this is due to the data being highly compressed and each partition becoming too big when it is deserialized to work on in memory. !image-2022-08-10-16-59-05-233.png! (If I try to do a repartition immediately after reading I still see the first stage spilling memory to disk, so that is not the right solution or what I'm interested in.) Instead, I attempt to lower maxPartitionBytes by the (average) compression ratio of my files (about 7x, so let's round up to 8). So I set maxPartitionBytes=16MB. At this point I see that spark is reading in from the file in 12-28 MB chunks. Now it makes 14316 partitions on the initial file read and completes with no spillage. !image-2022-08-10-16-59-59-778.png! Is there something I'm missing here? Is this just intended behavior? How can I tune my partition size correctly for my application when I do not know how much the data will be compressed ahead of time? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-40025: -- Description: Project Lightspeed is an umbrella project aimed at improving a couple of key aspects of Spark Streaming: * Improving the latency and ensuring it is predictable * Enhancing functionality for processing data with new operators and APIs Umbrella Jira to track all tickets under Project Lightspeed SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency was: Project Lightspeed is an umbrella project aimed at improving a couple of key aspects of Spark Streaming: * Improving the latency and ensuring it is predictable * Enhancing functionality for processing data with new operators and APIs Please reference full blog post for this project: [https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html] Umbrella Jira to track all tickets under Project Lightspeed SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency > Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark > -- > > Key: SPARK-40025 > URL: https://issues.apache.org/jira/browse/SPARK-40025 > Project: Spark > Issue Type: Umbrella > Components: Structured Streaming >Affects Versions: 3.2.2 >Reporter: Boyang Jerry Peng >Priority: Major > > Project Lightspeed is an umbrella project aimed at improving a couple of key > aspects of Spark Streaming: > * Improving the latency and ensuring it is predictable > * Enhancing functionality for processing data with new operators and APIs > > Umbrella Jira to track all tickets under Project Lightspeed > SPARK-39585 - Multiple Stateful Operators in Structured Streaming > SPARK-39586 - Advanced Windowing in Structured Streaming > SPARK-39587 - Schema Evolution for Stateful Pipelines > SPARK-39589 - Asynchronous I/O support > SPARK-39590 - Python API for Arbitrary Stateful Processing > SPARK-39591 - Offset Management Improvements > SPARK-39592 - Asynchronous State Checkpointing > SPARK-39593 - Configurable State Checkpointing Frequency -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-40025: -- Description: Project Lightspeed is an umbrella project aimed at improving a couple of key aspects of Spark Streaming: * Improving the latency and ensuring it is predictable * Enhancing functionality for processing data with new operators and APIs Please reference full blog post for this project: [https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html] Umbrella Jira to track all tickets under Project Lightspeed SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency was: Umbrella Jira to track all tickets under Project Lightspeed SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency > Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark > -- > > Key: SPARK-40025 > URL: https://issues.apache.org/jira/browse/SPARK-40025 > Project: Spark > Issue Type: Umbrella > Components: Structured Streaming >Affects Versions: 3.2.2 >Reporter: Boyang Jerry Peng >Priority: Major > > Project Lightspeed is an umbrella project aimed at improving a couple of key > aspects of Spark Streaming: > * Improving the latency and ensuring it is predictable > * Enhancing functionality for processing data with new operators and APIs > > Please reference full blog post for this project: > [https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html] > > > Umbrella Jira to track all tickets under Project Lightspeed > SPARK-39585 - Multiple Stateful Operators in Structured Streaming > SPARK-39586 - Advanced Windowing in Structured Streaming > SPARK-39587 - Schema Evolution for Stateful Pipelines > SPARK-39589 - Asynchronous I/O support > SPARK-39590 - Python API for Arbitrary Stateful Processing > SPARK-39591 - Offset Management Improvements > SPARK-39592 - Asynchronous State Checkpointing > SPARK-39593 - Configurable State Checkpointing Frequency -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-40025: -- Summary: Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark (was: Project Lightspeed (Spark Streaming Improvements)) > Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark > -- > > Key: SPARK-40025 > URL: https://issues.apache.org/jira/browse/SPARK-40025 > Project: Spark > Issue Type: Umbrella > Components: Structured Streaming >Affects Versions: 3.2.2 >Reporter: Boyang Jerry Peng >Priority: Major > > Umbrella Jira to track all tickets under Project Lightspeed > SPARK-39585 - Multiple Stateful Operators in Structured Streaming > SPARK-39586 - Advanced Windowing in Structured Streaming > SPARK-39587 - Schema Evolution for Stateful Pipelines > SPARK-39589 - Asynchronous I/O support > SPARK-39590 - Python API for Arbitrary Stateful Processing > SPARK-39591 - Offset Management Improvements > SPARK-39592 - Asynchronous State Checkpointing > SPARK-39593 - Configurable State Checkpointing Frequency -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39743) Unable to set zstd compression level while writing parquet files
[ https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-39743. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37416 [https://github.com/apache/spark/pull/37416] > Unable to set zstd compression level while writing parquet files > > > Key: SPARK-39743 > URL: https://issues.apache.org/jira/browse/SPARK-39743 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Assignee: zhiming she >Priority: Minor > Fix For: 3.4.0 > > > While writing zstd compressed parquet files, the following setting > `spark.io.compression.zstd.level` does not have any affect with regards to > the compression level of zstd. > All files seem to be written with the default zstd compression level, and the > config option seems to be ignored. > Using the zstd cli tool, we confirmed that setting a higher compression level > for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39743) Unable to set zstd compression level while writing parquet files
[ https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-39743: - Assignee: zhiming she > Unable to set zstd compression level while writing parquet files > > > Key: SPARK-39743 > URL: https://issues.apache.org/jira/browse/SPARK-39743 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Assignee: zhiming she >Priority: Minor > > While writing zstd compressed parquet files, the following setting > `spark.io.compression.zstd.level` does not have any affect with regards to > the compression level of zstd. > All files seem to be written with the default zstd compression level, and the > config option seems to be ignored. > Using the zstd cli tool, we confirmed that setting a higher compression level > for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39887) Expression transform error
[ https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578094#comment-17578094 ] Apache Spark commented on SPARK-39887: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/37472 > Expression transform error > -- > > Key: SPARK-39887 > URL: https://issues.apache.org/jira/browse/SPARK-39887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0, 3.2.2 >Reporter: zhuml >Priority: Major > > {code:java} > spark.sql( > """ > |select to_date(a) a, to_date(b) b from > |(select a, a as b from > |(select to_date(a) a from > | values ('2020-02-01') as t1(a) > | group by to_date(a)) t3 > |union all > |select a, b from > |(select to_date(a) a, to_date(b) b from > |values ('2020-01-01','2020-01-02') as t1(a, b) > | group by to_date(a), to_date(b)) t4) t5 > |group by to_date(a), to_date(b) > |""".stripMargin).show(){code} > result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01) > expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38503) Add warn for getAdditionalPreKubernetesResources in executor side
[ https://issues.apache.org/jira/browse/SPARK-38503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38503. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 35786 [https://github.com/apache/spark/pull/35786] > Add warn for getAdditionalPreKubernetesResources in executor side > - > > Key: SPARK-38503 > URL: https://issues.apache.org/jira/browse/SPARK-38503 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38503) Add warn for getAdditionalPreKubernetesResources in executor side
[ https://issues.apache.org/jira/browse/SPARK-38503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38503: - Assignee: Yikun Jiang > Add warn for getAdditionalPreKubernetesResources in executor side > - > > Key: SPARK-38503 > URL: https://issues.apache.org/jira/browse/SPARK-38503 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-40037: Description: [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647] [Info at SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327] [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569] [Info at SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703] [releases log|https://github.com/google/tink/releases/tag/v1.7.0] was: [https://www.cve.org/CVERecord?id=CVE-2022-25647 | ] [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK] [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK] [https://github.com/google/tink/releases/tag/v1.7.0|releases log] > Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0 > --- > > Key: SPARK-40037 > URL: https://issues.apache.org/jira/browse/SPARK-40037 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327] > [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569] > [Info at > SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703] > [releases log|https://github.com/google/tink/releases/tag/v1.7.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-40037: Description: [https://www.cve.org/CVERecord?id=CVE-2022-25647 | ] [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK] [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK] [https://github.com/google/tink/releases/tag/v1.7.0|releases log] was: [https://www.cve.org/CVERecord?id=CVE-2022-25647 | CVE-2022-25647] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK] [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK] [https://github.com/google/tink/releases/tag/v1.7.0|releases log] > Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0 > --- > > Key: SPARK-40037 > URL: https://issues.apache.org/jira/browse/SPARK-40037 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [https://www.cve.org/CVERecord?id=CVE-2022-25647 | ] > [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647] > [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at > SNYK] > [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569] > [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at > SNYK] > [https://github.com/google/tink/releases/tag/v1.7.0|releases log] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-40037: Description: [https://www.cve.org/CVERecord?id=CVE-2022-25647 | CVE-2022-25647] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK] [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK] [https://github.com/google/tink/releases/tag/v1.7.0|releases log] was: [https://www.cve.org/CVERecord?id=CVE-2022-25647| CVE-2022-25647] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK] [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK] [https://github.com/google/tink/releases/tag/v1.7.0|releases log] > Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0 > --- > > Key: SPARK-40037 > URL: https://issues.apache.org/jira/browse/SPARK-40037 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [https://www.cve.org/CVERecord?id=CVE-2022-25647 | CVE-2022-25647] > [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at > SNYK] > [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569] > [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at > SNYK] > [https://github.com/google/tink/releases/tag/v1.7.0|releases log] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-40037: Description: [https://www.cve.org/CVERecord?id=CVE-2022-25647| CVE-2022-25647] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK] [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK] [https://github.com/google/tink/releases/tag/v1.7.0|releases log] was: [https://www.cve.org/CVERecord?id=CVE-2022-25647|CVE-2022-25647] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK] [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK] [https://github.com/google/tink/releases/tag/v1.7.0|releases log] > Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0 > --- > > Key: SPARK-40037 > URL: https://issues.apache.org/jira/browse/SPARK-40037 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [https://www.cve.org/CVERecord?id=CVE-2022-25647| CVE-2022-25647] > [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at > SNYK] > [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569] > [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at > SNYK] > [https://github.com/google/tink/releases/tag/v1.7.0|releases log] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
Bjørn Jørgensen created SPARK-40037: --- Summary: Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0 Key: SPARK-40037 URL: https://issues.apache.org/jira/browse/SPARK-40037 Project: Spark Issue Type: Dependency upgrade Components: Build Affects Versions: 3.4.0 Reporter: Bjørn Jørgensen [https://www.cve.org/CVERecord?id=CVE-2022-25647|CVE-2022-25647] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327|Info at SNYK] [https://www.cve.org/CVERecord?id=CVE-2021-22569|CVE-2021-22569] [https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703|Info at SNYK] [https://github.com/google/tink/releases/tag/v1.7.0|releases log] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40036) LevelDB/RocksDBIterator.next should return false after iterator or db close
[ https://issues.apache.org/jira/browse/SPARK-40036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40036: Assignee: (was: Apache Spark) > LevelDB/RocksDBIterator.next should return false after iterator or db close > --- > > Key: SPARK-40036 > URL: https://issues.apache.org/jira/browse/SPARK-40036 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > {code:java} > @Test > public void testHasNextAndNextAfterIteratorClose() throws Exception { > String prefix = "test_db_iter_close."; > String suffix = ".ldb"; > File path = File.createTempFile(prefix, suffix); > path.delete(); > LevelDB db = new LevelDB(path); > // Write one records for test > db.write(createCustomType1(0)); > KVStoreIterator iter = > db.view(CustomType1.class).closeableIterator(); > // iter should be true > assertTrue(iter.hasNext()); > // close iter > iter.close(); > // iter.hasNext should be false after iter close > assertFalse(iter.hasNext()); > // iter.next should throw NoSuchElementException after iter close > assertThrows(NoSuchElementException.class, iter::next); > db.close(); > assertTrue(path.exists()); > FileUtils.deleteQuietly(path); > assertFalse(path.exists()); > } > @Test > public void testHasNextAndNextAfterDBClose() throws Exception { > String prefix = "test_db_db_close."; > String suffix = ".ldb"; > File path = File.createTempFile(prefix, suffix); > path.delete(); > LevelDB db = new LevelDB(path); > // Write one record for test > db.write(createCustomType1(0)); > KVStoreIterator iter = > db.view(CustomType1.class).closeableIterator(); > // iter should be true > assertTrue(iter.hasNext()); > // close db > db.close(); > // iter.hasNext should be false after db close > assertFalse(iter.hasNext()); > // iter.next should throw NoSuchElementException after db close > assertThrows(NoSuchElementException.class, iter::next); > assertTrue(path.exists()); > FileUtils.deleteQuietly(path); > assertFalse(path.exists()); > } {code} > > For the above two cases, when iterator/db is closed, `hasNext` will return > true, and `next` will return the value not obtained before close. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40036) LevelDB/RocksDBIterator.next should return false after iterator or db close
[ https://issues.apache.org/jira/browse/SPARK-40036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40036: Assignee: Apache Spark > LevelDB/RocksDBIterator.next should return false after iterator or db close > --- > > Key: SPARK-40036 > URL: https://issues.apache.org/jira/browse/SPARK-40036 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > {code:java} > @Test > public void testHasNextAndNextAfterIteratorClose() throws Exception { > String prefix = "test_db_iter_close."; > String suffix = ".ldb"; > File path = File.createTempFile(prefix, suffix); > path.delete(); > LevelDB db = new LevelDB(path); > // Write one records for test > db.write(createCustomType1(0)); > KVStoreIterator iter = > db.view(CustomType1.class).closeableIterator(); > // iter should be true > assertTrue(iter.hasNext()); > // close iter > iter.close(); > // iter.hasNext should be false after iter close > assertFalse(iter.hasNext()); > // iter.next should throw NoSuchElementException after iter close > assertThrows(NoSuchElementException.class, iter::next); > db.close(); > assertTrue(path.exists()); > FileUtils.deleteQuietly(path); > assertFalse(path.exists()); > } > @Test > public void testHasNextAndNextAfterDBClose() throws Exception { > String prefix = "test_db_db_close."; > String suffix = ".ldb"; > File path = File.createTempFile(prefix, suffix); > path.delete(); > LevelDB db = new LevelDB(path); > // Write one record for test > db.write(createCustomType1(0)); > KVStoreIterator iter = > db.view(CustomType1.class).closeableIterator(); > // iter should be true > assertTrue(iter.hasNext()); > // close db > db.close(); > // iter.hasNext should be false after db close > assertFalse(iter.hasNext()); > // iter.next should throw NoSuchElementException after db close > assertThrows(NoSuchElementException.class, iter::next); > assertTrue(path.exists()); > FileUtils.deleteQuietly(path); > assertFalse(path.exists()); > } {code} > > For the above two cases, when iterator/db is closed, `hasNext` will return > true, and `next` will return the value not obtained before close. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40036) LevelDB/RocksDBIterator.next should return false after iterator or db close
[ https://issues.apache.org/jira/browse/SPARK-40036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578066#comment-17578066 ] Apache Spark commented on SPARK-40036: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37471 > LevelDB/RocksDBIterator.next should return false after iterator or db close > --- > > Key: SPARK-40036 > URL: https://issues.apache.org/jira/browse/SPARK-40036 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > {code:java} > @Test > public void testHasNextAndNextAfterIteratorClose() throws Exception { > String prefix = "test_db_iter_close."; > String suffix = ".ldb"; > File path = File.createTempFile(prefix, suffix); > path.delete(); > LevelDB db = new LevelDB(path); > // Write one records for test > db.write(createCustomType1(0)); > KVStoreIterator iter = > db.view(CustomType1.class).closeableIterator(); > // iter should be true > assertTrue(iter.hasNext()); > // close iter > iter.close(); > // iter.hasNext should be false after iter close > assertFalse(iter.hasNext()); > // iter.next should throw NoSuchElementException after iter close > assertThrows(NoSuchElementException.class, iter::next); > db.close(); > assertTrue(path.exists()); > FileUtils.deleteQuietly(path); > assertFalse(path.exists()); > } > @Test > public void testHasNextAndNextAfterDBClose() throws Exception { > String prefix = "test_db_db_close."; > String suffix = ".ldb"; > File path = File.createTempFile(prefix, suffix); > path.delete(); > LevelDB db = new LevelDB(path); > // Write one record for test > db.write(createCustomType1(0)); > KVStoreIterator iter = > db.view(CustomType1.class).closeableIterator(); > // iter should be true > assertTrue(iter.hasNext()); > // close db > db.close(); > // iter.hasNext should be false after db close > assertFalse(iter.hasNext()); > // iter.next should throw NoSuchElementException after db close > assertThrows(NoSuchElementException.class, iter::next); > assertTrue(path.exists()); > FileUtils.deleteQuietly(path); > assertFalse(path.exists()); > } {code} > > For the above two cases, when iterator/db is closed, `hasNext` will return > true, and `next` will return the value not obtained before close. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40036) LevelDB/RocksDBIterator.next should return false after iterator or db close
Yang Jie created SPARK-40036: Summary: LevelDB/RocksDBIterator.next should return false after iterator or db close Key: SPARK-40036 URL: https://issues.apache.org/jira/browse/SPARK-40036 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Yang Jie {code:java} @Test public void testHasNextAndNextAfterIteratorClose() throws Exception { String prefix = "test_db_iter_close."; String suffix = ".ldb"; File path = File.createTempFile(prefix, suffix); path.delete(); LevelDB db = new LevelDB(path); // Write one records for test db.write(createCustomType1(0)); KVStoreIterator iter = db.view(CustomType1.class).closeableIterator(); // iter should be true assertTrue(iter.hasNext()); // close iter iter.close(); // iter.hasNext should be false after iter close assertFalse(iter.hasNext()); // iter.next should throw NoSuchElementException after iter close assertThrows(NoSuchElementException.class, iter::next); db.close(); assertTrue(path.exists()); FileUtils.deleteQuietly(path); assertFalse(path.exists()); } @Test public void testHasNextAndNextAfterDBClose() throws Exception { String prefix = "test_db_db_close."; String suffix = ".ldb"; File path = File.createTempFile(prefix, suffix); path.delete(); LevelDB db = new LevelDB(path); // Write one record for test db.write(createCustomType1(0)); KVStoreIterator iter = db.view(CustomType1.class).closeableIterator(); // iter should be true assertTrue(iter.hasNext()); // close db db.close(); // iter.hasNext should be false after db close assertFalse(iter.hasNext()); // iter.next should throw NoSuchElementException after db close assertThrows(NoSuchElementException.class, iter::next); assertTrue(path.exists()); FileUtils.deleteQuietly(path); assertFalse(path.exists()); } {code} For the above two cases, when iterator/db is closed, `hasNext` will return true, and `next` will return the value not obtained before close. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38899) DS V2 supports push down datetime functions
[ https://issues.apache.org/jira/browse/SPARK-38899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578052#comment-17578052 ] Apache Spark commented on SPARK-38899: -- User 'chenzhx' has created a pull request for this issue: https://github.com/apache/spark/pull/37469 > DS V2 supports push down datetime functions > --- > > Key: SPARK-38899 > URL: https://issues.apache.org/jira/browse/SPARK-38899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Zhixiong Chen >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38899) DS V2 supports push down datetime functions
[ https://issues.apache.org/jira/browse/SPARK-38899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578051#comment-17578051 ] Apache Spark commented on SPARK-38899: -- User 'chenzhx' has created a pull request for this issue: https://github.com/apache/spark/pull/37469 > DS V2 supports push down datetime functions > --- > > Key: SPARK-38899 > URL: https://issues.apache.org/jira/browse/SPARK-38899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Zhixiong Chen >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40034) PathOutputCommitters to work with dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-40034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40034: Assignee: (was: Apache Spark) > PathOutputCommitters to work with dynamic partition overwrite > - > > Key: SPARK-40034 > URL: https://issues.apache.org/jira/browse/SPARK-40034 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Steve Loughran >Priority: Minor > > sibling of MAPREDUCE-7403: allow PathOutputCommitter implementation to > declare that they support the semantics required by spark dynamic > partitioning: > * rename to work as expected > * working dir to be on same fs as final dir > They will do this through implementing StreamCapabilities and adding a new > probe, "mapreduce.job.committer.dynamic.partitioning" ; the spark side > changes are to > * postpone rejection of dynamic partition overwrite until the output > committer is created > * allow it if the committer implements StreamCapabilities and returns true > for {{hasCapability("mapreduce.job.committer.dynamic.partitioning"))) > this isn't going to be supported by the s3a committers, they don't meet the > requirements. The manifest committer of MAPREDUCE-7341 running against abfs > and gcs does work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40034) PathOutputCommitters to work with dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-40034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578014#comment-17578014 ] Apache Spark commented on SPARK-40034: -- User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/37468 > PathOutputCommitters to work with dynamic partition overwrite > - > > Key: SPARK-40034 > URL: https://issues.apache.org/jira/browse/SPARK-40034 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Steve Loughran >Priority: Minor > > sibling of MAPREDUCE-7403: allow PathOutputCommitter implementation to > declare that they support the semantics required by spark dynamic > partitioning: > * rename to work as expected > * working dir to be on same fs as final dir > They will do this through implementing StreamCapabilities and adding a new > probe, "mapreduce.job.committer.dynamic.partitioning" ; the spark side > changes are to > * postpone rejection of dynamic partition overwrite until the output > committer is created > * allow it if the committer implements StreamCapabilities and returns true > for {{hasCapability("mapreduce.job.committer.dynamic.partitioning"))) > this isn't going to be supported by the s3a committers, they don't meet the > requirements. The manifest committer of MAPREDUCE-7341 running against abfs > and gcs does work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40034) PathOutputCommitters to work with dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-40034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40034: Assignee: Apache Spark > PathOutputCommitters to work with dynamic partition overwrite > - > > Key: SPARK-40034 > URL: https://issues.apache.org/jira/browse/SPARK-40034 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Steve Loughran >Assignee: Apache Spark >Priority: Minor > > sibling of MAPREDUCE-7403: allow PathOutputCommitter implementation to > declare that they support the semantics required by spark dynamic > partitioning: > * rename to work as expected > * working dir to be on same fs as final dir > They will do this through implementing StreamCapabilities and adding a new > probe, "mapreduce.job.committer.dynamic.partitioning" ; the spark side > changes are to > * postpone rejection of dynamic partition overwrite until the output > committer is created > * allow it if the committer implements StreamCapabilities and returns true > for {{hasCapability("mapreduce.job.committer.dynamic.partitioning"))) > this isn't going to be supported by the s3a committers, they don't meet the > requirements. The manifest committer of MAPREDUCE-7341 running against abfs > and gcs does work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40035) Avoid apply filter twice when listing files
[ https://issues.apache.org/jira/browse/SPARK-40035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40035: Assignee: (was: Apache Spark) > Avoid apply filter twice when listing files > --- > > Key: SPARK-40035 > URL: https://issues.apache.org/jira/browse/SPARK-40035 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: EdisonWang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40035) Avoid apply filter twice when listing files
[ https://issues.apache.org/jira/browse/SPARK-40035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578008#comment-17578008 ] Apache Spark commented on SPARK-40035: -- User 'WangGuangxin' has created a pull request for this issue: https://github.com/apache/spark/pull/37467 > Avoid apply filter twice when listing files > --- > > Key: SPARK-40035 > URL: https://issues.apache.org/jira/browse/SPARK-40035 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: EdisonWang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40035) Avoid apply filter twice when listing files
[ https://issues.apache.org/jira/browse/SPARK-40035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578009#comment-17578009 ] Apache Spark commented on SPARK-40035: -- User 'WangGuangxin' has created a pull request for this issue: https://github.com/apache/spark/pull/37467 > Avoid apply filter twice when listing files > --- > > Key: SPARK-40035 > URL: https://issues.apache.org/jira/browse/SPARK-40035 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: EdisonWang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40035) Avoid apply filter twice when listing files
[ https://issues.apache.org/jira/browse/SPARK-40035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40035: Assignee: Apache Spark > Avoid apply filter twice when listing files > --- > > Key: SPARK-40035 > URL: https://issues.apache.org/jira/browse/SPARK-40035 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: EdisonWang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38910) Clean sparkStaging dir should before unregister()
[ https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-38910: -- Fix Version/s: 3.4.0 > Clean sparkStaging dir should before unregister() > - > > Key: SPARK-38910 > URL: https://issues.apache.org/jira/browse/SPARK-38910 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.1, 3.3.0 >Reporter: angerszhu >Priority: Minor > Fix For: 3.4.0 > > > {code:java} > ShutdownHookManager.addShutdownHook(priority) { () => > try { > val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf) > val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts > if (!finished) { > // The default state of ApplicationMaster is failed if it is > invoked by shut down hook. > // This behavior is different compared to 1.x version. > // If user application is exited ahead of time by calling > System.exit(N), here mark > // this application as failed with EXIT_EARLY. For a good > shutdown, user shouldn't call > // System.exit(0) to terminate the application. > finish(finalStatus, > ApplicationMaster.EXIT_EARLY, > "Shutdown hook called before final status was reported.") > } > if (!unregistered) { > // we only want to unregister if we don't want the RM to retry > if (finalStatus == FinalApplicationStatus.SUCCEEDED || > isLastAttempt) { > unregister(finalStatus, finalMsg) > cleanupStagingDir(new > Path(System.getenv("SPARK_YARN_STAGING_DIR"))) > } > } > } catch { > case e: Throwable => > logWarning("Ignoring Exception while stopping ApplicationMaster > from shutdown hook", e) > } > }{code} > unregister may throw exception, clean staging dir should before unregister. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40035) Avoid apply filter twice when listing files
EdisonWang created SPARK-40035: -- Summary: Avoid apply filter twice when listing files Key: SPARK-40035 URL: https://issues.apache.org/jira/browse/SPARK-40035 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: EdisonWang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38910) Clean sparkStaging dir should before unregister()
[ https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-38910. --- Resolution: Fixed > Clean sparkStaging dir should before unregister() > - > > Key: SPARK-38910 > URL: https://issues.apache.org/jira/browse/SPARK-38910 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.1, 3.3.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.4.0 > > > {code:java} > ShutdownHookManager.addShutdownHook(priority) { () => > try { > val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf) > val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts > if (!finished) { > // The default state of ApplicationMaster is failed if it is > invoked by shut down hook. > // This behavior is different compared to 1.x version. > // If user application is exited ahead of time by calling > System.exit(N), here mark > // this application as failed with EXIT_EARLY. For a good > shutdown, user shouldn't call > // System.exit(0) to terminate the application. > finish(finalStatus, > ApplicationMaster.EXIT_EARLY, > "Shutdown hook called before final status was reported.") > } > if (!unregistered) { > // we only want to unregister if we don't want the RM to retry > if (finalStatus == FinalApplicationStatus.SUCCEEDED || > isLastAttempt) { > unregister(finalStatus, finalMsg) > cleanupStagingDir(new > Path(System.getenv("SPARK_YARN_STAGING_DIR"))) > } > } > } catch { > case e: Throwable => > logWarning("Ignoring Exception while stopping ApplicationMaster > from shutdown hook", e) > } > }{code} > unregister may throw exception, clean staging dir should before unregister. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38910) Clean sparkStaging dir should before unregister()
[ https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-38910: - Assignee: angerszhu > Clean sparkStaging dir should before unregister() > - > > Key: SPARK-38910 > URL: https://issues.apache.org/jira/browse/SPARK-38910 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.1, 3.3.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.4.0 > > > {code:java} > ShutdownHookManager.addShutdownHook(priority) { () => > try { > val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf) > val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts > if (!finished) { > // The default state of ApplicationMaster is failed if it is > invoked by shut down hook. > // This behavior is different compared to 1.x version. > // If user application is exited ahead of time by calling > System.exit(N), here mark > // this application as failed with EXIT_EARLY. For a good > shutdown, user shouldn't call > // System.exit(0) to terminate the application. > finish(finalStatus, > ApplicationMaster.EXIT_EARLY, > "Shutdown hook called before final status was reported.") > } > if (!unregistered) { > // we only want to unregister if we don't want the RM to retry > if (finalStatus == FinalApplicationStatus.SUCCEEDED || > isLastAttempt) { > unregister(finalStatus, finalMsg) > cleanupStagingDir(new > Path(System.getenv("SPARK_YARN_STAGING_DIR"))) > } > } > } catch { > case e: Throwable => > logWarning("Ignoring Exception while stopping ApplicationMaster > from shutdown hook", e) > } > }{code} > unregister may throw exception, clean staging dir should before unregister. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40034) PathOutputCommitters to work with dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-40034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-40034: --- Summary: PathOutputCommitters to work with dynamic partition overwrite (was: PathOutputCommitters to work with dynamic partition overwrite -if they support it) > PathOutputCommitters to work with dynamic partition overwrite > - > > Key: SPARK-40034 > URL: https://issues.apache.org/jira/browse/SPARK-40034 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Steve Loughran >Priority: Minor > > sibling of MAPREDUCE-7403: allow PathOutputCommitter implementation to > declare that they support the semantics required by spark dynamic > partitioning: > * rename to work as expected > * working dir to be on same fs as final dir > They will do this through implementing StreamCapabilities and adding a new > probe, "mapreduce.job.committer.dynamic.partitioning" ; the spark side > changes are to > * postpone rejection of dynamic partition overwrite until the output > committer is created > * allow it if the committer implements StreamCapabilities and returns true > for {{hasCapability("mapreduce.job.committer.dynamic.partitioning"))) > this isn't going to be supported by the s3a committers, they don't meet the > requirements. The manifest committer of MAPREDUCE-7341 running against abfs > and gcs does work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40034) PathOutputCommitters to work with dynamic partition overwrite -if they support it
Steve Loughran created SPARK-40034: -- Summary: PathOutputCommitters to work with dynamic partition overwrite -if they support it Key: SPARK-40034 URL: https://issues.apache.org/jira/browse/SPARK-40034 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 3.4.0 Reporter: Steve Loughran sibling of MAPREDUCE-7403: allow PathOutputCommitter implementation to declare that they support the semantics required by spark dynamic partitioning: * rename to work as expected * working dir to be on same fs as final dir They will do this through implementing StreamCapabilities and adding a new probe, "mapreduce.job.committer.dynamic.partitioning" ; the spark side changes are to * postpone rejection of dynamic partition overwrite until the output committer is created * allow it if the committer implements StreamCapabilities and returns true for {{hasCapability("mapreduce.job.committer.dynamic.partitioning"))) this isn't going to be supported by the s3a committers, they don't meet the requirements. The manifest committer of MAPREDUCE-7341 running against abfs and gcs does work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39734) Add call_udf to pyspark.sql.functions
[ https://issues.apache.org/jira/browse/SPARK-39734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-39734: - Assignee: Ruifeng Zheng (was: Andrew Ray) > Add call_udf to pyspark.sql.functions > - > > Key: SPARK-39734 > URL: https://issues.apache.org/jira/browse/SPARK-39734 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Andrew Ray >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > > Add the call_udf function to PySpark for parity with the Scala/Java function > org.apache.spark.sql.functions#call_udf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36259) Expose localtimestamp in pyspark.sql.functions
[ https://issues.apache.org/jira/browse/SPARK-36259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-36259: - Assignee: Ruifeng Zheng > Expose localtimestamp in pyspark.sql.functions > -- > > Key: SPARK-36259 > URL: https://issues.apache.org/jira/browse/SPARK-36259 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > > localtimestamp is available in the scala sql functions, but currently not in > pyspark -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39733) Add map_contains_key to pyspark.sql.functions
[ https://issues.apache.org/jira/browse/SPARK-39733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-39733: - Assignee: Ruifeng Zheng > Add map_contains_key to pyspark.sql.functions > - > > Key: SPARK-39733 > URL: https://issues.apache.org/jira/browse/SPARK-39733 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Andrew Ray >Assignee: Ruifeng Zheng >Priority: Minor > > SPARK-37584 added the function map_contains_key to SQL and Scala/Java > functions. This JIRA is to track its addition to the PySpark function set for > parity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37348) PySpark pmod function
[ https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-37348. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37449 [https://github.com/apache/spark/pull/37449] > PySpark pmod function > - > > Key: SPARK-37348 > URL: https://issues.apache.org/jira/browse/SPARK-37348 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Tim Schwab >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > > Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns > -1. However, the modulus is often desired instead of the remainder. > > There is a [PMOD() function in Spark > SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in > PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions]. > So at the moment, the two options for getting the modulus is to use > F.expr("pmod(A, B)"), or create a helper function such as: > > {code:java} > def pmod(dividend, divisor): > return F.when(dividend < 0, (dividend % divisor) + > divisor).otherwise(dividend % divisor){code} > > > Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37348) PySpark pmod function
[ https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-37348: - Assignee: Ruifeng Zheng > PySpark pmod function > - > > Key: SPARK-37348 > URL: https://issues.apache.org/jira/browse/SPARK-37348 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Tim Schwab >Assignee: Ruifeng Zheng >Priority: Minor > > Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns > -1. However, the modulus is often desired instead of the remainder. > > There is a [PMOD() function in Spark > SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in > PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions]. > So at the moment, the two options for getting the modulus is to use > F.expr("pmod(A, B)"), or create a helper function such as: > > {code:java} > def pmod(dividend, divisor): > return F.when(dividend < 0, (dividend % divisor) + > divisor).otherwise(dividend % divisor){code} > > > Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39733) Add map_contains_key to pyspark.sql.functions
[ https://issues.apache.org/jira/browse/SPARK-39733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-39733. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37449 [https://github.com/apache/spark/pull/37449] > Add map_contains_key to pyspark.sql.functions > - > > Key: SPARK-39733 > URL: https://issues.apache.org/jira/browse/SPARK-39733 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Andrew Ray >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > > SPARK-37584 added the function map_contains_key to SQL and Scala/Java > functions. This JIRA is to track its addition to the PySpark function set for > parity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36259) Expose localtimestamp in pyspark.sql.functions
[ https://issues.apache.org/jira/browse/SPARK-36259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-36259. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37449 [https://github.com/apache/spark/pull/37449] > Expose localtimestamp in pyspark.sql.functions > -- > > Key: SPARK-36259 > URL: https://issues.apache.org/jira/browse/SPARK-36259 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Priority: Minor > Fix For: 3.4.0 > > > localtimestamp is available in the scala sql functions, but currently not in > pyspark -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39734) Add call_udf to pyspark.sql.functions
[ https://issues.apache.org/jira/browse/SPARK-39734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-39734: - Assignee: Andrew Ray > Add call_udf to pyspark.sql.functions > - > > Key: SPARK-39734 > URL: https://issues.apache.org/jira/browse/SPARK-39734 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Andrew Ray >Assignee: Andrew Ray >Priority: Minor > > Add the call_udf function to PySpark for parity with the Scala/Java function > org.apache.spark.sql.functions#call_udf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39734) Add call_udf to pyspark.sql.functions
[ https://issues.apache.org/jira/browse/SPARK-39734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-39734. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37449 [https://github.com/apache/spark/pull/37449] > Add call_udf to pyspark.sql.functions > - > > Key: SPARK-39734 > URL: https://issues.apache.org/jira/browse/SPARK-39734 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Andrew Ray >Assignee: Andrew Ray >Priority: Minor > Fix For: 3.4.0 > > > Add the call_udf function to PySpark for parity with the Scala/Java function > org.apache.spark.sql.functions#call_udf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40029) Make pyspark.sql.types examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577946#comment-17577946 ] Apache Spark commented on SPARK-40029: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37465 > Make pyspark.sql.types examples self-contained > -- > > Key: SPARK-40029 > URL: https://issues.apache.org/jira/browse/SPARK-40029 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40029) Make pyspark.sql.types examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577945#comment-17577945 ] Apache Spark commented on SPARK-40029: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37465 > Make pyspark.sql.types examples self-contained > -- > > Key: SPARK-40029 > URL: https://issues.apache.org/jira/browse/SPARK-40029 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40029) Make pyspark.sql.types examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40029: Assignee: Apache Spark > Make pyspark.sql.types examples self-contained > -- > > Key: SPARK-40029 > URL: https://issues.apache.org/jira/browse/SPARK-40029 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40029) Make pyspark.sql.types examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40029: Assignee: (was: Apache Spark) > Make pyspark.sql.types examples self-contained > -- > > Key: SPARK-40029 > URL: https://issues.apache.org/jira/browse/SPARK-40029 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40032) Support Decimal128 type
[ https://issues.apache.org/jira/browse/SPARK-40032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577942#comment-17577942 ] jiaan.geng commented on SPARK-40032: We are editing the design doc. > Support Decimal128 type > --- > > Key: SPARK-40032 > URL: https://issues.apache.org/jira/browse/SPARK-40032 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > Attachments: Performance comparison between decimal128 and spark > decimal benchmark.pdf > > > Spark SQL today supports the DECIMAL data type. The implementation of Decimal > that can hold a BigDecimal or Long. Decimal provides some operators like +, > -, *, / and so on. > Take the + as example, the implementation show below. > {code:java} > def + (that: Decimal): Decimal = { > if (decimalVal.eq(null) && that.decimalVal.eq(null) && scale == > that.scale) { > Decimal(longVal + that.longVal, Math.max(precision, that.precision) + > 1, scale) > } else { > Decimal(toBigDecimal.bigDecimal.add(that.toBigDecimal.bigDecimal)) > } > } > {code} > We can see there exists two addition and call Decimal.apply. The add operator > of BigDecimal will construct a new BigDecimal instance. > The implementation of Decimal.apply will call new to construct a new Decimal > instance with the new BigDecimal instance. > As we know, Decimal instance will hold the new BigDecimal instance. > If a large table has a Decimal field called 'colA, the execution of > SUM('colA) will involve the creation of a large number of Decimal instances > and BigDecimal instances. These Decimal instances and BigDecimal instances > will lead to garbage collection to occur frequently. > Decimal128 is a high-performance decimal about 8X more efficient than Java > BigDecimal for typical operations. It uses a finite (128 bit) precision and > can handle up to decimal(38, X). It is also "mutable" so you can change the > contents of an existing object. This helps reduce the cost of new() and > garbage collection. > We have generate a benchmark report for compare Spark Decimal, Java > BigDecimal and Decimal128. Please see the attachment. > In this new feature, we will introduce DECIMAL128 to accelerate decimal > calculation. > h3. Milestone 1 – Spark Decimal equivalency ( The new Decimal type Decimal128 > meets or exceeds all function of the existing SQL Decimal): > * Add a new DataType implementation for Decimal128. > * Support Decimal128 in Dataset/UDF. > * Decimal128 literals > * Decimal128 arithmetic(e.g. Decimal128 + Decimal128, Decimal128 - Decimal) > * Decimal or Math functions/operators: POWER, LOG, Round, etc > * Cast to and from Decimal128, cast String/Decimal to Decimal128, cast > Decimal128 to string (pretty printing)/Decimal, with the * * SQL syntax to > specify the types > * Support sorting Decimal128. > h3. Milestone 2 – Persistence: > * Ability to create tables of type Decimal128 > * Ability to write to common file formats such as Parquet and JSON. > * INSERT, SELECT, UPDATE, MERGE > * Discovery > h3. Milestone 3 – Client support > * JDBC support > * Hive Thrift server > h3. Milestone 4 – PySpark and Spark R integration > * Python UDF can take and return Decimal128 > * DataFrame support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40032) Support Decimal128 type
[ https://issues.apache.org/jira/browse/SPARK-40032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-40032: --- Attachment: Performance comparison between decimal128 and spark decimal benchmark.pdf > Support Decimal128 type > --- > > Key: SPARK-40032 > URL: https://issues.apache.org/jira/browse/SPARK-40032 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > Attachments: Performance comparison between decimal128 and spark > decimal benchmark.pdf > > > Spark SQL today supports the DECIMAL data type. The implementation of Decimal > that can hold a BigDecimal or Long. Decimal provides some operators like +, > -, *, / and so on. > Take the + as example, the implementation show below. > {code:java} > def + (that: Decimal): Decimal = { > if (decimalVal.eq(null) && that.decimalVal.eq(null) && scale == > that.scale) { > Decimal(longVal + that.longVal, Math.max(precision, that.precision) + > 1, scale) > } else { > Decimal(toBigDecimal.bigDecimal.add(that.toBigDecimal.bigDecimal)) > } > } > {code} > We can see there exists two addition and call Decimal.apply. The add operator > of BigDecimal will construct a new BigDecimal instance. > The implementation of Decimal.apply will call new to construct a new Decimal > instance with the new BigDecimal instance. > As we know, Decimal instance will hold the new BigDecimal instance. > If a large table has a Decimal field called 'colA, the execution of > SUM('colA) will involve the creation of a large number of Decimal instances > and BigDecimal instances. These Decimal instances and BigDecimal instances > will lead to garbage collection to occur frequently. > Decimal128 is a high-performance decimal about 8X more efficient than Java > BigDecimal for typical operations. It uses a finite (128 bit) precision and > can handle up to decimal(38, X). It is also "mutable" so you can change the > contents of an existing object. This helps reduce the cost of new() and > garbage collection. > We have generate a benchmark report for compare Spark Decimal, Java > BigDecimal and Decimal128. Please see the attachment. > In this new feature, we will introduce DECIMAL128 to accelerate decimal > calculation. > h3. Milestone 1 – Spark Decimal equivalency ( The new Decimal type Decimal128 > meets or exceeds all function of the existing SQL Decimal): > * Add a new DataType implementation for Decimal128. > * Support Decimal128 in Dataset/UDF. > * Decimal128 literals > * Decimal128 arithmetic(e.g. Decimal128 + Decimal128, Decimal128 - Decimal) > * Decimal or Math functions/operators: POWER, LOG, Round, etc > * Cast to and from Decimal128, cast String/Decimal to Decimal128, cast > Decimal128 to string (pretty printing)/Decimal, with the * * SQL syntax to > specify the types > * Support sorting Decimal128. > h3. Milestone 2 – Persistence: > * Ability to create tables of type Decimal128 > * Ability to write to common file formats such as Parquet and JSON. > * INSERT, SELECT, UPDATE, MERGE > * Discovery > h3. Milestone 3 – Client support > * JDBC support > * Hive Thrift server > h3. Milestone 4 – PySpark and Spark R integration > * Python UDF can take and return Decimal128 > * DataFrame support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40032) Support Decimal128 type
[ https://issues.apache.org/jira/browse/SPARK-40032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-40032: --- Attachment: (was: Performance comparison between decimal128 and spark decimal benchmark.pdf) > Support Decimal128 type > --- > > Key: SPARK-40032 > URL: https://issues.apache.org/jira/browse/SPARK-40032 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Spark SQL today supports the DECIMAL data type. The implementation of Decimal > that can hold a BigDecimal or Long. Decimal provides some operators like +, > -, *, / and so on. > Take the + as example, the implementation show below. > {code:java} > def + (that: Decimal): Decimal = { > if (decimalVal.eq(null) && that.decimalVal.eq(null) && scale == > that.scale) { > Decimal(longVal + that.longVal, Math.max(precision, that.precision) + > 1, scale) > } else { > Decimal(toBigDecimal.bigDecimal.add(that.toBigDecimal.bigDecimal)) > } > } > {code} > We can see there exists two addition and call Decimal.apply. The add operator > of BigDecimal will construct a new BigDecimal instance. > The implementation of Decimal.apply will call new to construct a new Decimal > instance with the new BigDecimal instance. > As we know, Decimal instance will hold the new BigDecimal instance. > If a large table has a Decimal field called 'colA, the execution of > SUM('colA) will involve the creation of a large number of Decimal instances > and BigDecimal instances. These Decimal instances and BigDecimal instances > will lead to garbage collection to occur frequently. > Decimal128 is a high-performance decimal about 8X more efficient than Java > BigDecimal for typical operations. It uses a finite (128 bit) precision and > can handle up to decimal(38, X). It is also "mutable" so you can change the > contents of an existing object. This helps reduce the cost of new() and > garbage collection. > We have generate a benchmark report for compare Spark Decimal, Java > BigDecimal and Decimal128. Please see the attachment. > In this new feature, we will introduce DECIMAL128 to accelerate decimal > calculation. > h3. Milestone 1 – Spark Decimal equivalency ( The new Decimal type Decimal128 > meets or exceeds all function of the existing SQL Decimal): > * Add a new DataType implementation for Decimal128. > * Support Decimal128 in Dataset/UDF. > * Decimal128 literals > * Decimal128 arithmetic(e.g. Decimal128 + Decimal128, Decimal128 - Decimal) > * Decimal or Math functions/operators: POWER, LOG, Round, etc > * Cast to and from Decimal128, cast String/Decimal to Decimal128, cast > Decimal128 to string (pretty printing)/Decimal, with the * * SQL syntax to > specify the types > * Support sorting Decimal128. > h3. Milestone 2 – Persistence: > * Ability to create tables of type Decimal128 > * Ability to write to common file formats such as Parquet and JSON. > * INSERT, SELECT, UPDATE, MERGE > * Discovery > h3. Milestone 3 – Client support > * JDBC support > * Hive Thrift server > h3. Milestone 4 – PySpark and Spark R integration > * Python UDF can take and return Decimal128 > * DataFrame support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org