[jira] [Assigned] (SPARK-37627) Add sorted column in BucketTransform
[ https://issues.apache.org/jira/browse/SPARK-37627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37627: Assignee: Apache Spark > Add sorted column in BucketTransform > > > Key: SPARK-37627 > URL: https://issues.apache.org/jira/browse/SPARK-37627 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > In V1, we can create table with sorted bucket like the following: > {code:java} > sql("CREATE TABLE tbl(a INT, b INT) USING parquet " + > "CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS") > {code} > However, creating table with sorted bucket in V2 failed with Exception > {code:java} > org.apache.spark.sql.AnalysisException: Cannot convert bucketing with sort > columns to a transform. > {code} > We should be able to create table with sorted bucket in V2. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37627) Add sorted column in BucketTransform
[ https://issues.apache.org/jira/browse/SPARK-37627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37627: Assignee: (was: Apache Spark) > Add sorted column in BucketTransform > > > Key: SPARK-37627 > URL: https://issues.apache.org/jira/browse/SPARK-37627 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Minor > > In V1, we can create table with sorted bucket like the following: > {code:java} > sql("CREATE TABLE tbl(a INT, b INT) USING parquet " + > "CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS") > {code} > However, creating table with sorted bucket in V2 failed with Exception > {code:java} > org.apache.spark.sql.AnalysisException: Cannot convert bucketing with sort > columns to a transform. > {code} > We should be able to create table with sorted bucket in V2. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37627) Add sorted column in BucketTransform
[ https://issues.apache.org/jira/browse/SPARK-37627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458185#comment-17458185 ] Apache Spark commented on SPARK-37627: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/34879 > Add sorted column in BucketTransform > > > Key: SPARK-37627 > URL: https://issues.apache.org/jira/browse/SPARK-37627 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Minor > > In V1, we can create table with sorted bucket like the following: > {code:java} > sql("CREATE TABLE tbl(a INT, b INT) USING parquet " + > "CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS") > {code} > However, creating table with sorted bucket in V2 failed with Exception > {code:java} > org.apache.spark.sql.AnalysisException: Cannot convert bucketing with sort > columns to a transform. > {code} > We should be able to create table with sorted bucket in V2. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37627) Add sorted column in BucketTransform
Huaxin Gao created SPARK-37627: -- Summary: Add sorted column in BucketTransform Key: SPARK-37627 URL: https://issues.apache.org/jira/browse/SPARK-37627 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Huaxin Gao In V1, we can create table with sorted bucket like the following: {code:java} sql("CREATE TABLE tbl(a INT, b INT) USING parquet " + "CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS") {code} However, creating table with sorted bucket in V2 failed with Exception {code:java} org.apache.spark.sql.AnalysisException: Cannot convert bucketing with sort columns to a transform. {code} We should be able to create table with sorted bucket in V2. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37626) Upgrade libthrift to 0.15.0
[ https://issues.apache.org/jira/browse/SPARK-37626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458178#comment-17458178 ] Bo Zhang edited comment on SPARK-37626 at 12/13/21, 7:26 AM: - 0.16.0 is not release yet. Could we upgrade to 0.15.0 first? was (Author: bozhang): 0.16.0 is not release yet. Could we upgrade to 0.15.0 first? Here is the PR for that: https://github.com/apache/spark/pull/34878 > Upgrade libthrift to 0.15.0 > --- > > Key: SPARK-37626 > URL: https://issues.apache.org/jira/browse/SPARK-37626 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > Fix For: 3.3.0 > > > Upgrade libthrift to 1.15.0 in order to avoid > https://nvd.nist.gov/vuln/detail/CVE-2020-13949. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37626) Upgrade libthrift to 0.15.0
[ https://issues.apache.org/jira/browse/SPARK-37626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37626: Assignee: Apache Spark > Upgrade libthrift to 0.15.0 > --- > > Key: SPARK-37626 > URL: https://issues.apache.org/jira/browse/SPARK-37626 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > > Upgrade libthrift to 1.15.0 in order to avoid > https://nvd.nist.gov/vuln/detail/CVE-2020-13949. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37626) Upgrade libthrift to 0.15.0
[ https://issues.apache.org/jira/browse/SPARK-37626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37626: Assignee: (was: Apache Spark) > Upgrade libthrift to 0.15.0 > --- > > Key: SPARK-37626 > URL: https://issues.apache.org/jira/browse/SPARK-37626 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > Fix For: 3.3.0 > > > Upgrade libthrift to 1.15.0 in order to avoid > https://nvd.nist.gov/vuln/detail/CVE-2020-13949. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37626) Upgrade libthrift to 0.15.0
[ https://issues.apache.org/jira/browse/SPARK-37626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458178#comment-17458178 ] Bo Zhang commented on SPARK-37626: -- 0.16.0 is not release yet. Could we upgrade to 0.15.0 first? Here is the PR for that: https://github.com/apache/spark/pull/34878 > Upgrade libthrift to 0.15.0 > --- > > Key: SPARK-37626 > URL: https://issues.apache.org/jira/browse/SPARK-37626 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > Fix For: 3.3.0 > > > Upgrade libthrift to 1.15.0 in order to avoid > https://nvd.nist.gov/vuln/detail/CVE-2020-13949. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37626) Upgrade libthrift to 0.15.0
[ https://issues.apache.org/jira/browse/SPARK-37626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458177#comment-17458177 ] Apache Spark commented on SPARK-37626: -- User 'bozhang2820' has created a pull request for this issue: https://github.com/apache/spark/pull/34878 > Upgrade libthrift to 0.15.0 > --- > > Key: SPARK-37626 > URL: https://issues.apache.org/jira/browse/SPARK-37626 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > Fix For: 3.3.0 > > > Upgrade libthrift to 1.15.0 in order to avoid > https://nvd.nist.gov/vuln/detail/CVE-2020-13949. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37626) Upgrade libthrift to 0.15.0
[ https://issues.apache.org/jira/browse/SPARK-37626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Zhang updated SPARK-37626: - Summary: Upgrade libthrift to 0.15.0 (was: Upgrade libthrift to 1.15.0) > Upgrade libthrift to 0.15.0 > --- > > Key: SPARK-37626 > URL: https://issues.apache.org/jira/browse/SPARK-37626 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > Fix For: 3.3.0 > > > Upgrade libthrift to 1.15.0 in order to avoid > https://nvd.nist.gov/vuln/detail/CVE-2020-13949. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37626) Upgrade libthrift to 1.15.0
[ https://issues.apache.org/jira/browse/SPARK-37626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458173#comment-17458173 ] Yuming Wang commented on SPARK-37626: - We need to upgrade to 0.16.0 because we need [this patch|https://github.com/apache/thrift/pull/2470]. > Upgrade libthrift to 1.15.0 > --- > > Key: SPARK-37626 > URL: https://issues.apache.org/jira/browse/SPARK-37626 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > Fix For: 3.3.0 > > > Upgrade libthrift to 1.15.0 in order to avoid > https://nvd.nist.gov/vuln/detail/CVE-2020-13949. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType
[ https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37577: -- Environment: (was: Py: 3.9) > ClassCastException: ArrayType cannot be cast to StructType > -- > > Key: SPARK-37577 > URL: https://issues.apache.org/jira/browse/SPARK-37577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Rafal Wojdyla >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.3.0 > > > Reproduction: > {code:python} > import pyspark.sql.functions as F > from pyspark.sql.types import StructType, StructField, ArrayType, StringType > t = StructType([StructField('o', ArrayType(StructType([StructField('s', >StringType(), False), StructField('b', >ArrayType(StructType([StructField('e', StringType(), >False)]), True), False)]), True), False)]) > ( > spark.createDataFrame([], schema=t) > .select(F.explode("o").alias("eo")) > .select("eo.*") > .select(F.explode("b")) > .count() > ) > {code} > Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note > that if you remove, field {{s}}, the code works fine, which is a bit > unexpected and likely a clue. > {noformat} > Py4JJavaError: An error occurred while calling o156.count. > : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType > cannot be cast to class org.apache.spark.sql.types.StructType > (org.apache.spark.sql.types.ArrayType and > org.apache.spark.sql.types.StructType are in unnamed module of loader 'app') > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:101) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$.unapply(NestedColumnAliasing.scala:366) > at >
[jira] [Resolved] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType
[ https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37577. --- Fix Version/s: 3.3.0 Assignee: L. C. Hsieh Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/34845 > ClassCastException: ArrayType cannot be cast to StructType > -- > > Key: SPARK-37577 > URL: https://issues.apache.org/jira/browse/SPARK-37577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: Py: 3.9 >Reporter: Rafal Wojdyla >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.3.0 > > > Reproduction: > {code:python} > import pyspark.sql.functions as F > from pyspark.sql.types import StructType, StructField, ArrayType, StringType > t = StructType([StructField('o', ArrayType(StructType([StructField('s', >StringType(), False), StructField('b', >ArrayType(StructType([StructField('e', StringType(), >False)]), True), False)]), True), False)]) > ( > spark.createDataFrame([], schema=t) > .select(F.explode("o").alias("eo")) > .select("eo.*") > .select(F.explode("b")) > .count() > ) > {code} > Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note > that if you remove, field {{s}}, the code works fine, which is a bit > unexpected and likely a clue. > {noformat} > Py4JJavaError: An error occurred while calling o156.count. > : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType > cannot be cast to class org.apache.spark.sql.types.StructType > (org.apache.spark.sql.types.ArrayType and > org.apache.spark.sql.types.StructType are in unnamed module of loader 'app') > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368) > at > org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:101) > at >
[jira] [Created] (SPARK-37626) Upgrade libthrift to 1.15.0
Bo Zhang created SPARK-37626: Summary: Upgrade libthrift to 1.15.0 Key: SPARK-37626 URL: https://issues.apache.org/jira/browse/SPARK-37626 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.3.0 Reporter: Bo Zhang Fix For: 3.3.0 Upgrade libthrift to 1.15.0 in order to avoid https://nvd.nist.gov/vuln/detail/CVE-2020-13949. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37625) update log4j to 2.15
[ https://issues.apache.org/jira/browse/SPARK-37625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458164#comment-17458164 ] Apache Spark commented on SPARK-37625: -- User 'qwe' has created a pull request for this issue: https://github.com/apache/spark/pull/34877 > update log4j to 2.15 > - > > Key: SPARK-37625 > URL: https://issues.apache.org/jira/browse/SPARK-37625 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: weifeng zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37625) update log4j to 2.15
[ https://issues.apache.org/jira/browse/SPARK-37625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37625: Assignee: Apache Spark > update log4j to 2.15 > - > > Key: SPARK-37625 > URL: https://issues.apache.org/jira/browse/SPARK-37625 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: weifeng zhang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37625) update log4j to 2.15
[ https://issues.apache.org/jira/browse/SPARK-37625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458163#comment-17458163 ] Apache Spark commented on SPARK-37625: -- User 'qwe' has created a pull request for this issue: https://github.com/apache/spark/pull/34877 > update log4j to 2.15 > - > > Key: SPARK-37625 > URL: https://issues.apache.org/jira/browse/SPARK-37625 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: weifeng zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37625) update log4j to 2.15
[ https://issues.apache.org/jira/browse/SPARK-37625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37625: Assignee: (was: Apache Spark) > update log4j to 2.15 > - > > Key: SPARK-37625 > URL: https://issues.apache.org/jira/browse/SPARK-37625 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: weifeng zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37625) update log4j to 2.15
weifeng zhang created SPARK-37625: - Summary: update log4j to 2.15 Key: SPARK-37625 URL: https://issues.apache.org/jira/browse/SPARK-37625 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: weifeng zhang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-36571) Optimized FileOutputCommitter with StagingDir
[ https://issues.apache.org/jira/browse/SPARK-36571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu reopened SPARK-36571: --- > Optimized FileOutputCommitter with StagingDir > - > > Key: SPARK-36571 > URL: https://issues.apache.org/jira/browse/SPARK-36571 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36995) Add new SQLFileCommitProtocol
[ https://issues.apache.org/jira/browse/SPARK-36995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu resolved SPARK-36995. --- Resolution: Duplicate > Add new SQLFileCommitProtocol > - > > Key: SPARK-36995 > URL: https://issues.apache.org/jira/browse/SPARK-36995 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36571) Optimized FileOutputCommitter with StagingDir
[ https://issues.apache.org/jira/browse/SPARK-36571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu resolved SPARK-36571. --- Resolution: Duplicate > Optimized FileOutputCommitter with StagingDir > - > > Key: SPARK-36571 > URL: https://issues.apache.org/jira/browse/SPARK-36571 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37624) Suppress warnings for live pandas-on-Spark quickstart notebooks
[ https://issues.apache.org/jira/browse/SPARK-37624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37624: Assignee: (was: Apache Spark) > Suppress warnings for live pandas-on-Spark quickstart notebooks > --- > > Key: SPARK-37624 > URL: https://issues.apache.org/jira/browse/SPARK-37624 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.2.0, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Attachments: Screen Shot 2021-12-13 at 1.32.05 PM.png, Screen Shot > 2021-12-13 at 2.02.20 PM.png, Screen Shot 2021-12-13 at 2.02.25 PM.png, > Screen Shot 2021-12-13 at 2.02.45 PM.png > > > https://mybinder.org/v2/gh/apache/spark/9e614e265f?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb > shows a bunch of warnings. We should better hide them as it's a quick start > guide for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37624) Suppress warnings for live pandas-on-Spark quickstart notebooks
[ https://issues.apache.org/jira/browse/SPARK-37624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458153#comment-17458153 ] Apache Spark commented on SPARK-37624: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34875 > Suppress warnings for live pandas-on-Spark quickstart notebooks > --- > > Key: SPARK-37624 > URL: https://issues.apache.org/jira/browse/SPARK-37624 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.2.0, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Attachments: Screen Shot 2021-12-13 at 1.32.05 PM.png, Screen Shot > 2021-12-13 at 2.02.20 PM.png, Screen Shot 2021-12-13 at 2.02.25 PM.png, > Screen Shot 2021-12-13 at 2.02.45 PM.png > > > https://mybinder.org/v2/gh/apache/spark/9e614e265f?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb > shows a bunch of warnings. We should better hide them as it's a quick start > guide for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37624) Suppress warnings for live pandas-on-Spark quickstart notebooks
[ https://issues.apache.org/jira/browse/SPARK-37624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37624: Assignee: Apache Spark > Suppress warnings for live pandas-on-Spark quickstart notebooks > --- > > Key: SPARK-37624 > URL: https://issues.apache.org/jira/browse/SPARK-37624 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.2.0, 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > Attachments: Screen Shot 2021-12-13 at 1.32.05 PM.png, Screen Shot > 2021-12-13 at 2.02.20 PM.png, Screen Shot 2021-12-13 at 2.02.25 PM.png, > Screen Shot 2021-12-13 at 2.02.45 PM.png > > > https://mybinder.org/v2/gh/apache/spark/9e614e265f?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb > shows a bunch of warnings. We should better hide them as it's a quick start > guide for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37624) Suppress warnings for live pandas-on-Spark quickstart notebooks
[ https://issues.apache.org/jira/browse/SPARK-37624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37624: - Summary: Suppress warnings for live pandas-on-Spark quickstart notebooks (was: Surppress warnings for live pandas-on-Spark quickstart notebooks) > Suppress warnings for live pandas-on-Spark quickstart notebooks > --- > > Key: SPARK-37624 > URL: https://issues.apache.org/jira/browse/SPARK-37624 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.2.0, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Attachments: Screen Shot 2021-12-13 at 1.32.05 PM.png, Screen Shot > 2021-12-13 at 2.02.20 PM.png, Screen Shot 2021-12-13 at 2.02.25 PM.png, > Screen Shot 2021-12-13 at 2.02.45 PM.png > > > https://mybinder.org/v2/gh/apache/spark/9e614e265f?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb > shows a bunch of warnings. We should better hide them as it's a quick start > guide for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37623) Support ANSI Aggregate Function: regr_slope & regr_intercept
[ https://issues.apache.org/jira/browse/SPARK-37623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-37623: --- Summary: Support ANSI Aggregate Function: regr_slope & regr_intercept (was: Support ANSI Aggregate Function: regr_slope) > Support ANSI Aggregate Function: regr_slope & regr_intercept > > > Key: SPARK-37623 > URL: https://issues.apache.org/jira/browse/SPARK-37623 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > REGR_SLOPE is an ANSI aggregate functions. many database support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37624) Surppress warnings for live pandas-on-Spark quickstart notebooks
[ https://issues.apache.org/jira/browse/SPARK-37624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37624: - Attachment: Screen Shot 2021-12-13 at 2.02.25 PM.png > Surppress warnings for live pandas-on-Spark quickstart notebooks > > > Key: SPARK-37624 > URL: https://issues.apache.org/jira/browse/SPARK-37624 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.2.0, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Attachments: Screen Shot 2021-12-13 at 1.32.05 PM.png, Screen Shot > 2021-12-13 at 2.02.20 PM.png, Screen Shot 2021-12-13 at 2.02.25 PM.png, > Screen Shot 2021-12-13 at 2.02.45 PM.png > > > https://mybinder.org/v2/gh/apache/spark/9e614e265f?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb > shows a bunch of warnings. We should better hide them as it's a quick start > guide for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37624) Surppress warnings for live pandas-on-Spark quickstart notebooks
[ https://issues.apache.org/jira/browse/SPARK-37624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37624: - Attachment: Screen Shot 2021-12-13 at 1.32.05 PM.png > Surppress warnings for live pandas-on-Spark quickstart notebooks > > > Key: SPARK-37624 > URL: https://issues.apache.org/jira/browse/SPARK-37624 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.2.0, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Attachments: Screen Shot 2021-12-13 at 1.32.05 PM.png, Screen Shot > 2021-12-13 at 2.02.20 PM.png, Screen Shot 2021-12-13 at 2.02.25 PM.png, > Screen Shot 2021-12-13 at 2.02.45 PM.png > > > https://mybinder.org/v2/gh/apache/spark/9e614e265f?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb > shows a bunch of warnings. We should better hide them as it's a quick start > guide for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37624) Surppress warnings for live pandas-on-Spark quickstart notebooks
[ https://issues.apache.org/jira/browse/SPARK-37624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37624: - Attachment: Screen Shot 2021-12-13 at 2.02.45 PM.png > Surppress warnings for live pandas-on-Spark quickstart notebooks > > > Key: SPARK-37624 > URL: https://issues.apache.org/jira/browse/SPARK-37624 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.2.0, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Attachments: Screen Shot 2021-12-13 at 1.32.05 PM.png, Screen Shot > 2021-12-13 at 2.02.20 PM.png, Screen Shot 2021-12-13 at 2.02.25 PM.png, > Screen Shot 2021-12-13 at 2.02.45 PM.png > > > https://mybinder.org/v2/gh/apache/spark/9e614e265f?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb > shows a bunch of warnings. We should better hide them as it's a quick start > guide for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37624) Surppress warnings for live pandas-on-Spark quickstart notebooks
[ https://issues.apache.org/jira/browse/SPARK-37624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37624: - Priority: Minor (was: Major) > Surppress warnings for live pandas-on-Spark quickstart notebooks > > > Key: SPARK-37624 > URL: https://issues.apache.org/jira/browse/SPARK-37624 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.2.0, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Attachments: Screen Shot 2021-12-13 at 1.32.05 PM.png, Screen Shot > 2021-12-13 at 2.02.20 PM.png, Screen Shot 2021-12-13 at 2.02.25 PM.png, > Screen Shot 2021-12-13 at 2.02.45 PM.png > > > https://mybinder.org/v2/gh/apache/spark/9e614e265f?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb > shows a bunch of warnings. We should better hide them as it's a quick start > guide for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37624) Surppress warnings for live pandas-on-Spark quickstart notebooks
[ https://issues.apache.org/jira/browse/SPARK-37624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37624: - Attachment: Screen Shot 2021-12-13 at 2.02.20 PM.png > Surppress warnings for live pandas-on-Spark quickstart notebooks > > > Key: SPARK-37624 > URL: https://issues.apache.org/jira/browse/SPARK-37624 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.2.0, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Attachments: Screen Shot 2021-12-13 at 1.32.05 PM.png, Screen Shot > 2021-12-13 at 2.02.20 PM.png, Screen Shot 2021-12-13 at 2.02.25 PM.png, > Screen Shot 2021-12-13 at 2.02.45 PM.png > > > https://mybinder.org/v2/gh/apache/spark/9e614e265f?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb > shows a bunch of warnings. We should better hide them as it's a quick start > guide for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37624) Surppress warnings for live pandas-on-Spark quickstart notebooks
Hyukjin Kwon created SPARK-37624: Summary: Surppress warnings for live pandas-on-Spark quickstart notebooks Key: SPARK-37624 URL: https://issues.apache.org/jira/browse/SPARK-37624 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 3.2.0, 3.3.0 Reporter: Hyukjin Kwon https://mybinder.org/v2/gh/apache/spark/9e614e265f?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb shows a bunch of warnings. We should better hide them as it's a quick start guide for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37569) View Analysis incorrectly marks nested fields as nullable
[ https://issues.apache.org/jira/browse/SPARK-37569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37569. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34839 [https://github.com/apache/spark/pull/34839] > View Analysis incorrectly marks nested fields as nullable > - > > Key: SPARK-37569 > URL: https://issues.apache.org/jira/browse/SPARK-37569 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Shardul Mahadik >Assignee: Shardul Mahadik >Priority: Major > Fix For: 3.3.0 > > > Consider a view as follows with all fields non-nullable (required) > {code:java} > spark.sql(""" > CREATE OR REPLACE VIEW v AS > SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > """) > {code} > we can see that the view schema has been correctly stored as non-nullable > {code:java} > scala> > System.out.println(spark.sessionState.catalog.externalCatalog.getTable("default", > "v2")) > CatalogTable( > Database: default > Table: v2 > Owner: smahadik > Created Time: Tue Dec 07 09:00:42 PST 2021 > Last Access: UNKNOWN > Created By: Spark 3.3.0-SNAPSHOT > Type: VIEW > View Text: SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > View Original Text: SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > View Catalog and Namespace: spark_catalog.default > View Query Output Columns: [id, nested] > Table Properties: [transient_lastDdlTime=1638896442] > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Storage Properties: [serialization.format=1] > Schema: root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- a: long (nullable = false) > ) > {code} > However, when trying to read this view, it incorrectly marks nested column > {{a}} as nullable > {code:java} > scala> spark.table("v2").printSchema > root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- a: long (nullable = true) > {code} > This is caused by [this > line|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3546] > in Analyzer.scala. Going through the history of changes for this block of > code, it seems like {{asNullable}} is a remnant of a time before we added > [checks|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3543] > to ensure that the from and to types of the cast were compatible. As > nullability is already checked, it should be safe to add a cast without > converting the target datatype to nullable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37569) View Analysis incorrectly marks nested fields as nullable
[ https://issues.apache.org/jira/browse/SPARK-37569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37569: --- Assignee: Shardul Mahadik > View Analysis incorrectly marks nested fields as nullable > - > > Key: SPARK-37569 > URL: https://issues.apache.org/jira/browse/SPARK-37569 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Shardul Mahadik >Assignee: Shardul Mahadik >Priority: Major > > Consider a view as follows with all fields non-nullable (required) > {code:java} > spark.sql(""" > CREATE OR REPLACE VIEW v AS > SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > """) > {code} > we can see that the view schema has been correctly stored as non-nullable > {code:java} > scala> > System.out.println(spark.sessionState.catalog.externalCatalog.getTable("default", > "v2")) > CatalogTable( > Database: default > Table: v2 > Owner: smahadik > Created Time: Tue Dec 07 09:00:42 PST 2021 > Last Access: UNKNOWN > Created By: Spark 3.3.0-SNAPSHOT > Type: VIEW > View Text: SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > View Original Text: SELECT id, named_struct('a', id) AS nested > FROM RANGE(10) > View Catalog and Namespace: spark_catalog.default > View Query Output Columns: [id, nested] > Table Properties: [transient_lastDdlTime=1638896442] > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Storage Properties: [serialization.format=1] > Schema: root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- a: long (nullable = false) > ) > {code} > However, when trying to read this view, it incorrectly marks nested column > {{a}} as nullable > {code:java} > scala> spark.table("v2").printSchema > root > |-- id: long (nullable = false) > |-- nested: struct (nullable = false) > ||-- a: long (nullable = true) > {code} > This is caused by [this > line|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3546] > in Analyzer.scala. Going through the history of changes for this block of > code, it seems like {{asNullable}} is a remnant of a time before we added > [checks|https://github.com/apache/spark/blob/fb40c0e19f84f2de9a3d69d809e9e4031f76ef90/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3543] > to ensure that the from and to types of the cast were compatible. As > nullability is already checked, it should be safe to add a cast without > converting the target datatype to nullable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37590) Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
[ https://issues.apache.org/jira/browse/SPARK-37590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37590. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34842 [https://github.com/apache/spark/pull/34842] > Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests > > > Key: SPARK-37590 > URL: https://issues.apache.org/jira/browse/SPARK-37590 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.3.0 > > > Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37590) Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
[ https://issues.apache.org/jira/browse/SPARK-37590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37590: --- Assignee: Terry Kim > Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests > > > Key: SPARK-37590 > URL: https://issues.apache.org/jira/browse/SPARK-37590 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37300) TaskSchedulerImpl should ignore task finished event if its task was already finished state
[ https://issues.apache.org/jira/browse/SPARK-37300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi resolved SPARK-37300. -- Fix Version/s: 3.3.0 Assignee: hujiahua Resolution: Fixed Issue resolved by https://github.com/apache/spark/pull/34578 > TaskSchedulerImpl should ignore task finished event if its task was already > finished state > -- > > Key: SPARK-37300 > URL: https://issues.apache.org/jira/browse/SPARK-37300 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: hujiahua >Assignee: hujiahua >Priority: Major > Fix For: 3.3.0 > > > `TaskSchedulerImpl` handle task finished event at `handleSuccessfulTask` and > `handleFailedTask` , but in some case the task was already finished state, > which we should ignore task finished event. > Case describe: > when a executor finished a task of some stage, the driver will receive a > StatusUpdate event to handle it. At the same time the driver found the > executor heartbeat timed out, so the dirver also need handle ExecutorLost > event simultaneously. There was a race condition issues here, which will make > TaskSetManager.successful and TaskSetManager.tasksSuccessful wrong result. > More detailed description and discussion can be viewed at > https://issues.apache.org/jira/browse/SPARK-36575 and > https://github.com/apache/spark/pull/33872 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37623) Support ANSI Aggregate Function: regr_slope
[ https://issues.apache.org/jira/browse/SPARK-37623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458132#comment-17458132 ] jiaan.geng commented on SPARK-37623: I'm working on. > Support ANSI Aggregate Function: regr_slope > --- > > Key: SPARK-37623 > URL: https://issues.apache.org/jira/browse/SPARK-37623 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > REGR_SLOPE is an ANSI aggregate functions. many database support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37623) Support ANSI Aggregate Function: regr_slope
jiaan.geng created SPARK-37623: -- Summary: Support ANSI Aggregate Function: regr_slope Key: SPARK-37623 URL: https://issues.apache.org/jira/browse/SPARK-37623 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37623) Support ANSI Aggregate Function: regr_slope
[ https://issues.apache.org/jira/browse/SPARK-37623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-37623: --- Description: REGR_SLOPE is an ANSI aggregate functions. many database support it. > Support ANSI Aggregate Function: regr_slope > --- > > Key: SPARK-37623 > URL: https://issues.apache.org/jira/browse/SPARK-37623 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > REGR_SLOPE is an ANSI aggregate functions. many database support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37481) Disappearance of skipped stages mislead the bug hunting
[ https://issues.apache.org/jira/browse/SPARK-37481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458131#comment-17458131 ] wuyi commented on SPARK-37481: -- Backport fix to 3.1/3.0 is still in progress. > Disappearance of skipped stages mislead the bug hunting > > > Key: SPARK-37481 > URL: https://issues.apache.org/jira/browse/SPARK-37481 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.2.0, 3.3.0 > > > # > ## With FetchFailedException and Map Stage Retries > When rerunning spark-sql shell with the original SQL in > [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315] > !https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png! > 1. stage 3 threw FetchFailedException and caused itself and its parent > stage(stage 2) to retry > 2. stage 2 was skipped before but its attemptId was still 0, so when its > retry happened it got removed from `Skipped Stages` > The DAG of Job 2 doesn't show that stage 2 is skipped anymore. > !https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png! > Besides, a retried stage usually has a subset of tasks from the original > stage. If we mark it as an original one, the metrics might lead us into > pitfalls. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37481) Disappearance of skipped stages mislead the bug hunting
[ https://issues.apache.org/jira/browse/SPARK-37481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi resolved SPARK-37481. -- Fix Version/s: 3.3.0 3.2.0 Assignee: Kent Yao Resolution: Fixed Issue resolved by https://github.com/apache/spark/pull/34735 > Disappearance of skipped stages mislead the bug hunting > > > Key: SPARK-37481 > URL: https://issues.apache.org/jira/browse/SPARK-37481 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.3.0, 3.2.0 > > > # > ## With FetchFailedException and Map Stage Retries > When rerunning spark-sql shell with the original SQL in > [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315] > !https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png! > 1. stage 3 threw FetchFailedException and caused itself and its parent > stage(stage 2) to retry > 2. stage 2 was skipped before but its attemptId was still 0, so when its > retry happened it got removed from `Skipped Stages` > The DAG of Job 2 doesn't show that stage 2 is skipped anymore. > !https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png! > Besides, a retried stage usually has a subset of tasks from the original > stage. If we mark it as an original one, the metrics might lead us into > pitfalls. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37620) Use more precise types for SparkContext Optional fields (i.e. _gateway, _jvm)
[ https://issues.apache.org/jira/browse/SPARK-37620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458128#comment-17458128 ] Byron Hsu commented on SPARK-37620: --- i will work on it > Use more precise types for SparkContext Optional fields (i.e. _gateway, _jvm) > -- > > Key: SPARK-37620 > URL: https://issues.apache.org/jira/browse/SPARK-37620 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > As a part of SPARK-37152 [we > agreed|https://github.com/apache/spark/pull/34466/files#r762609181] to keep > these typed as not {{Optional}} for simplicity, but this just a temporary > solution to move things forward, until we decide on best approach to handle > such cases. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37369) Avoid redundant ColumnarToRow transistion on InMemoryTableScan
[ https://issues.apache.org/jira/browse/SPARK-37369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-37369. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34642 [https://github.com/apache/spark/pull/34642] > Avoid redundant ColumnarToRow transistion on InMemoryTableScan > -- > > Key: SPARK-37369 > URL: https://issues.apache.org/jira/browse/SPARK-37369 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.3.0 > > > We have a rule to insert columnar transition between row-based and columnar > query plans. InMemoryTableScanExec can produce columnar output. So if its > parent plan isn't columnar, the rule adds a ColumnarToRow between them. > But InMemoryTableScanExec is a special query plan because it can convert from > cached batch to columnar batch or row. > For such case, we ask InMemoryTableScanExec to convert cached batch to > columnar batch, and then convert to row in the added ColumnarToRow, before > the parent query. > So for such case, we can simply ask InMemoryTableScanExec to produce row > output instead of a redundant conversion. > ``` >+- Union > > > :- ColumnarToRow > > > : +- InMemoryTableScan [i#8, j#9] > > > :+- InMemoryRelation [i#8, j#9], StorageLevel(disk, > memory, deserialized, 1 replicas) > ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37369) Avoid redundant ColumnarToRow transistion on InMemoryTableScan
[ https://issues.apache.org/jira/browse/SPARK-37369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-37369: --- Assignee: L. C. Hsieh > Avoid redundant ColumnarToRow transistion on InMemoryTableScan > -- > > Key: SPARK-37369 > URL: https://issues.apache.org/jira/browse/SPARK-37369 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > We have a rule to insert columnar transition between row-based and columnar > query plans. InMemoryTableScanExec can produce columnar output. So if its > parent plan isn't columnar, the rule adds a ColumnarToRow between them. > But InMemoryTableScanExec is a special query plan because it can convert from > cached batch to columnar batch or row. > For such case, we ask InMemoryTableScanExec to convert cached batch to > columnar batch, and then convert to row in the added ColumnarToRow, before > the parent query. > So for such case, we can simply ask InMemoryTableScanExec to produce row > output instead of a redundant conversion. > ``` >+- Union > > > :- ColumnarToRow > > > : +- InMemoryTableScan [i#8, j#9] > > > :+- InMemoryRelation [i#8, j#9], StorageLevel(disk, > memory, deserialized, 1 replicas) > ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37622) Support K8s executor rolling policy
[ https://issues.apache.org/jira/browse/SPARK-37622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458116#comment-17458116 ] Apache Spark commented on SPARK-37622: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/34874 > Support K8s executor rolling policy > --- > > Key: SPARK-37622 > URL: https://issues.apache.org/jira/browse/SPARK-37622 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37622) Support K8s executor rolling policy
[ https://issues.apache.org/jira/browse/SPARK-37622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37622: Assignee: (was: Apache Spark) > Support K8s executor rolling policy > --- > > Key: SPARK-37622 > URL: https://issues.apache.org/jira/browse/SPARK-37622 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37622) Support K8s executor rolling policy
[ https://issues.apache.org/jira/browse/SPARK-37622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37622: Assignee: Apache Spark > Support K8s executor rolling policy > --- > > Key: SPARK-37622 > URL: https://issues.apache.org/jira/browse/SPARK-37622 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37622) Support K8s executor rolling policy
[ https://issues.apache.org/jira/browse/SPARK-37622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458115#comment-17458115 ] Apache Spark commented on SPARK-37622: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/34874 > Support K8s executor rolling policy > --- > > Key: SPARK-37622 > URL: https://issues.apache.org/jira/browse/SPARK-37622 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36038) Basic speculation metrics at stage level
[ https://issues.apache.org/jira/browse/SPARK-36038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-36038. Assignee: Thejdeep Gudivada Resolution: Fixed Issue resolved in https://github.com/apache/spark/pull/34607 > Basic speculation metrics at stage level > > > Key: SPARK-36038 > URL: https://issues.apache.org/jira/browse/SPARK-36038 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Venkata krishnan Sowrirajan >Assignee: Thejdeep Gudivada >Priority: Major > Fix For: 3.3.0 > > > Currently there are no speculation metrics available either at application > level or at stage level. With in our platform, we have added speculation > metrics at stage level as a summary similarly to the stage level metrics > tracking numTotalSpeculated, numCompleted (successful), numFailed, numKilled > etc. This enables us to effectively understand speculative execution feature > at an application level and helps in further tuning the speculation configs. > cc [~ron8hu] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37622) Support K8s executor rolling policy
Dongjoon Hyun created SPARK-37622: - Summary: Support K8s executor rolling policy Key: SPARK-37622 URL: https://issues.apache.org/jira/browse/SPARK-37622 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37619) Upgrade Maven to 3.8.4
[ https://issues.apache.org/jira/browse/SPARK-37619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37619: Assignee: Dongjoon Hyun > Upgrade Maven to 3.8.4 > -- > > Key: SPARK-37619 > URL: https://issues.apache.org/jira/browse/SPARK-37619 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37619) Upgrade Maven to 3.8.4
[ https://issues.apache.org/jira/browse/SPARK-37619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37619. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34873 [https://github.com/apache/spark/pull/34873] > Upgrade Maven to 3.8.4 > -- > > Key: SPARK-37619 > URL: https://issues.apache.org/jira/browse/SPARK-37619 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37603) org.apache.spark.sql.catalyst.expressions.ListQuery; no valid constructor
[ https://issues.apache.org/jira/browse/SPARK-37603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37603. -- Resolution: Cannot Reproduce > org.apache.spark.sql.catalyst.expressions.ListQuery; no valid constructor > - > > Key: SPARK-37603 > URL: https://issues.apache.org/jira/browse/SPARK-37603 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: huangweiguo >Priority: Major > > I don't use the class org.apache.spark.sql.catalyst.expressions.ListQuery, > but it > serialize exception (maybe): > > {code:java} > //stack > Caused by: java.io.InvalidClassException: > org.apache.spark.sql.catalyst.expressions.ListQuery; no valid constructor > at > java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150) > at > java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:790) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2001) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at >
[jira] [Commented] (SPARK-37607) Refactor antlr4 syntax file
[ https://issues.apache.org/jira/browse/SPARK-37607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458105#comment-17458105 ] Hyukjin Kwon commented on SPARK-37607: -- [~melin] can you double check the link? it shows no page found. > Refactor antlr4 syntax file > --- > > Key: SPARK-37607 > URL: https://issues.apache.org/jira/browse/SPARK-37607 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: melin >Priority: Major > > Refer to the ShardingSphere project, split into multiple files by type, very > clear > https://github.com/apache/shardingsphere/tree/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardin > gsphere-sql-parser-mysql/src/main/antlr4/imports/mysql -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37608) Can anyone help me to compile a scala file
[ https://issues.apache.org/jira/browse/SPARK-37608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37608. -- Resolution: Invalid > Can anyone help me to compile a scala file > -- > > Key: SPARK-37608 > URL: https://issues.apache.org/jira/browse/SPARK-37608 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.2.3 > Environment: I am currently using spark 2.3.2.3.1.5.0-152 and Scala > 2.11.12 >Reporter: Gary Liu >Priority: Major > > I am working with spark SQL, trying to read data from SAS using JDBC. I have > to use customized JdbcDialect to handle special format of the SAS data. Here > is the code: > {code:java} > import org.apache.spark.sql.jdbc.JdbcDialectobject > object SasDialect extends JdbcDialect { > override def canHandle(url: String): Boolean = > url.startsWith("jdbc:sharenet") > override def quoteIdentifier(colName: String): String = "\"" + colName + > "\"n" > } > {code} > It worked well in Scala, but my major language is Python, so I need to > compile this into jar file, so I can call it from python, like > {code:python} > from py4j.java_gateway import java_import > gw = spark.sparkContext._gateway > java_import(gw.jvm, "com.me.SasJDBCDialect") > gw.jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect( > gw.jvm.com.me.SasJDBCDialect()){code} > But I am a newbie of Scala, and don't know how to make it a jar file so it > can be call as "com.me.MyJDBCDialect". Can anyone here do me a favour to > generate a jar file for me? > > Thanks! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37607) Refactor antlr4 syntax file
[ https://issues.apache.org/jira/browse/SPARK-37607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37607: - Summary: Refactor antlr4 syntax file (was: Optimized antlr4 syntax file) > Refactor antlr4 syntax file > --- > > Key: SPARK-37607 > URL: https://issues.apache.org/jira/browse/SPARK-37607 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: melin >Priority: Major > > Refer to the ShardingSphere project, split into multiple files by type, very > clear > https://github.com/apache/shardingsphere/tree/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardin > gsphere-sql-parser-mysql/src/main/antlr4/imports/mysql -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37608) Can anyone help me to compile a scala file
[ https://issues.apache.org/jira/browse/SPARK-37608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458104#comment-17458104 ] Hyukjin Kwon commented on SPARK-37608: -- For questions, let's discuss in Spark mailing list instead of filing it as a ticket - you would be able to get better answers there. > Can anyone help me to compile a scala file > -- > > Key: SPARK-37608 > URL: https://issues.apache.org/jira/browse/SPARK-37608 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.2.3 > Environment: I am currently using spark 2.3.2.3.1.5.0-152 and Scala > 2.11.12 >Reporter: Gary Liu >Priority: Major > > I am working with spark SQL, trying to read data from SAS using JDBC. I have > to use customized JdbcDialect to handle special format of the SAS data. Here > is the code: > {code:java} > import org.apache.spark.sql.jdbc.JdbcDialectobject > object SasDialect extends JdbcDialect { > override def canHandle(url: String): Boolean = > url.startsWith("jdbc:sharenet") > override def quoteIdentifier(colName: String): String = "\"" + colName + > "\"n" > } > {code} > It worked well in Scala, but my major language is Python, so I need to > compile this into jar file, so I can call it from python, like > {code:python} > from py4j.java_gateway import java_import > gw = spark.sparkContext._gateway > java_import(gw.jvm, "com.me.SasJDBCDialect") > gw.jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect( > gw.jvm.com.me.SasJDBCDialect()){code} > But I am a newbie of Scala, and don't know how to make it a jar file so it > can be call as "com.me.MyJDBCDialect". Can anyone here do me a favour to > generate a jar file for me? > > Thanks! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37609) Transient StackOverflowError on DataFrame from Catalyst QueryPlan
[ https://issues.apache.org/jira/browse/SPARK-37609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458103#comment-17458103 ] Hyukjin Kwon commented on SPARK-37609: -- [~ravwojdyla] I don;t think people will dare to reproduce and debug for further investigation. it would be great to have minimised self-contained reproducer here. > Transient StackOverflowError on DataFrame from Catalyst QueryPlan > - > > Key: SPARK-37609 > URL: https://issues.apache.org/jira/browse/SPARK-37609 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.2 > Environment: py:3.9 >Reporter: Rafal Wojdyla >Priority: Major > > I sporadically observe a StackOverflowError from Catalyst's QueryPlan (for a > relatively complicated query), below is a stacktrace from the {{count}} on > that DF. It's a bit troubling because it's a transient error, with enough > retries (no change to code, probably some kind of cache?), I can get the op > to work :( > {noformat} > --- > Py4JJavaError Traceback (most recent call last) > ~/miniconda3/envs/tr-dev/lib/python3.9/site-packages/pyspark/sql/dataframe.py > in count(self) > 662 2 > 663 """ > --> 664 return int(self._jdf.count()) > 665 > 666 def collect(self): > ~/miniconda3/envs/tr-dev/lib/python3.9/site-packages/py4j/java_gateway.py in > __call__(self, *args) >1302 >1303 answer = self.gateway_client.send_command(command) > -> 1304 return_value = get_return_value( >1305 answer, self.gateway_client, self.target_id, self.name) >1306 > ~/miniconda3/envs/tr-dev/lib/python3.9/site-packages/pyspark/sql/utils.py in > deco(*a, **kw) > 109 def deco(*a, **kw): > 110 try: > --> 111 return f(*a, **kw) > 112 except py4j.protocol.Py4JJavaError as e: > 113 converted = convert_exception(e.java_exception) > ~/miniconda3/envs/tr-dev/lib/python3.9/site-packages/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) > 325 if answer[1] == REFERENCE_TYPE: > --> 326 raise Py4JJavaError( > 327 "An error occurred while calling {0}{1}{2}.\n". > 328 format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o9123.count. > : java.lang.StackOverflowError > at > org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:188) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:192) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:192) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37610) Anonymized/obfuscated query plan?
[ https://issues.apache.org/jira/browse/SPARK-37610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458102#comment-17458102 ] Hyukjin Kwon commented on SPARK-37610: -- I think it's better to have workload specific script to address this. I doubt if we can have a generalized way to hide sensitive information for the whole query plan since it can contains arbitrary information. > Anonymized/obfuscated query plan? > - > > Key: SPARK-37610 > URL: https://issues.apache.org/jira/browse/SPARK-37610 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.2 >Reporter: Rafal Wojdyla >Priority: Major > > I would like to share a query plan for a specific > issue(https://issues.apache.org/jira/browse/SPARK-37609), but can't without > at least anonymising the column names. If I could call {{explain}} that would > anonymise/obfuscate the column names that would be useful. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37610) Anonymized/obfuscated query plan
[ https://issues.apache.org/jira/browse/SPARK-37610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37610: - Summary: Anonymized/obfuscated query plan (was: Anonymized/obfuscated query plan?) > Anonymized/obfuscated query plan > > > Key: SPARK-37610 > URL: https://issues.apache.org/jira/browse/SPARK-37610 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.2 >Reporter: Rafal Wojdyla >Priority: Major > > I would like to share a query plan for a specific > issue(https://issues.apache.org/jira/browse/SPARK-37609), but can't without > at least anonymising the column names. If I could call {{explain}} that would > anonymise/obfuscate the column names that would be useful. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37621) ClassCastException when trying to persist the result of a join between two Iceberg tables
[ https://issues.apache.org/jira/browse/SPARK-37621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458101#comment-17458101 ] Hyukjin Kwon commented on SPARK-37621: -- Is this an issue specific to icebug? or other sources in Spark too? > ClassCastException when trying to persist the result of a join between two > Iceberg tables > - > > Key: SPARK-37621 > URL: https://issues.apache.org/jira/browse/SPARK-37621 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.2 >Reporter: Ciprian Gerea >Priority: Major > > I am gettin an error when I try to persist the results on a Join operation. > Note that both tables to be joined and the output table are Iceberg tables. > SQL code to repro. > String sqlJoin = String.format( > "SELECT * from " + > "((select %s from %s.%s where %s ) l " + > "join (select %s from %s.%s where %s ) r " + > "using (%s))", > ); > spark.sql(sqlJoin).writeTo("ciptest.ttt").option("write-format", > "parquet").createOrReplace(); > My exception stack is: > {{Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow}} > {{at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$1.writeValue(UnsafeRowSerializer.scala:64)}} > {{at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249)}} > {{at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)}} > {{at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)}} > {{at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)}} > {{at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)}} > {{at org.apache.spark.scheduler.Task.run(Task.scala:131)}} > {{at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)}} > {{at ….}} > Explain on the Sql statement gets the following plan: > {{== Physical Plan ==}} > {{Project [ ... ]}} > {{+- SortMergeJoin […], Inner}} > {{ :- Sort […], false, 0}} > {{ : +- Exchange hashpartitioning(…, 10), ENSURE_REQUIREMENTS, [id=#38]}} > {{ : +- Filter (…)}} > {{ :+- BatchScan[... ] left [filters=…]}} > {{ +- *(2) Sort […], false, 0}} > {{ +- Exchange hashpartitioning(…, 10), ENSURE_REQUIREMENTS, [id=#47]}} > {{ +- *(1) Filter (…)}} > {{ +- BatchScan[…] right [filters=…] }} > {{Note that several variations of this fail. Besides the repro code listed > above I have tried doing CTAS and trying to write the result into parquet > files without making a table out of it.}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37617) In CTAS,replace alias to normal column that satisfy parquet schema
[ https://issues.apache.org/jira/browse/SPARK-37617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-37617: --- Description: In CTAS, Replace name columns that have not alias. Mostly, columns without alias always is operator such as sum, divide that will lead to Parquet schema check error. (was: In CTAS, Replace name columns that have not alias. Mostly, columns without alias always is operator such as sum, divide that will lead to schema check error.) > In CTAS,replace alias to normal column that satisfy parquet schema > -- > > Key: SPARK-37617 > URL: https://issues.apache.org/jira/browse/SPARK-37617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiahong.li >Priority: Minor > Fix For: 3.2.0 > > > In CTAS, Replace name columns that have not alias. Mostly, columns without > alias always is operator such as sum, divide that will lead to Parquet schema > check error. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37598) Pyspark's newAPIHadoopRDD() method fails with ShortWritables
[ https://issues.apache.org/jira/browse/SPARK-37598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37598: Assignee: Keith Massey > Pyspark's newAPIHadoopRDD() method fails with ShortWritables > > > Key: SPARK-37598 > URL: https://issues.apache.org/jira/browse/SPARK-37598 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0 >Reporter: Keith Massey >Assignee: Keith Massey >Priority: Minor > > If sc. newAPIHadoopRDD() is called from Pyspark using an InputFormat that has > a ShortWritable as a field, then the call to newAPIHadoopRDD() fails. The > reason is that shortWritable is not explicitly handled by PythonHadoopUtil > the way that other numeric writables are (like LongWritable). The result is > that the ShortWritable is not converted to an object that can be serialized > by spark, and a serialization error occurs. Below is an example stack trace > from within the pyspark shell: > {code:java} > >>> rdd = > >>> sc.newAPIHadoopRDD(inputFormatClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].EsInputFormat;, > ... > keyClass="[org.apache.hadoop.io|http://org.apache.hadoop.io/].NullWritable;, > ... > valueClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].LinkedMapWritable;, > ... conf=conf) > 2021-12-08 14:38:40,439 ERROR scheduler.TaskSetManager: task 0.0 in stage > 15.0 (TID 31) had a not serializable result: > org.apache.hadoop.io.ShortWritable > Serialization stack: > - object not serializable (class: > [org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1) > - writeObject data (class: java.util.HashMap) > - object (class java.util.HashMap, \{price=1}) > - field (class: scala.Tuple2, name: _2, type: class java.lang.Object) > - object (class scala.Tuple2, (1,\{price=1})) > - element of array (index: 0) > - array (class [Lscala.Tuple2;, size 1); not retrying > Traceback (most recent call last): > File "", line 4, in > File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/context.py", > line 853, in newAPIHadoopRDD > jconf, batchSize) > File > "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", > line 1305, in __call__ > File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/sql/utils.py", > line 111, in deco > return f(*a, **kw) > File > "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. > : org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 > in stage 15.0 (TID 31) had a not serializable result: > org.apache.hadoop.io.ShortWritable > Serialization stack: > - object not serializable (class: > [org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1) > - writeObject data (class: java.util.HashMap) > - object (class java.util.HashMap, \{price=1}) > - field (class: scala.Tuple2, name: _2, type: class java.lang.Object) > - object (class scala.Tuple2, (1,\{price=1})) > - element of array (index: 0) > - array (class [Lscala.Tuple2;, size 1) > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
[jira] [Resolved] (SPARK-37598) Pyspark's newAPIHadoopRDD() method fails with ShortWritables
[ https://issues.apache.org/jira/browse/SPARK-37598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37598. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34838 [https://github.com/apache/spark/pull/34838] > Pyspark's newAPIHadoopRDD() method fails with ShortWritables > > > Key: SPARK-37598 > URL: https://issues.apache.org/jira/browse/SPARK-37598 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0 >Reporter: Keith Massey >Assignee: Keith Massey >Priority: Minor > Fix For: 3.3.0 > > > If sc. newAPIHadoopRDD() is called from Pyspark using an InputFormat that has > a ShortWritable as a field, then the call to newAPIHadoopRDD() fails. The > reason is that shortWritable is not explicitly handled by PythonHadoopUtil > the way that other numeric writables are (like LongWritable). The result is > that the ShortWritable is not converted to an object that can be serialized > by spark, and a serialization error occurs. Below is an example stack trace > from within the pyspark shell: > {code:java} > >>> rdd = > >>> sc.newAPIHadoopRDD(inputFormatClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].EsInputFormat;, > ... > keyClass="[org.apache.hadoop.io|http://org.apache.hadoop.io/].NullWritable;, > ... > valueClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].LinkedMapWritable;, > ... conf=conf) > 2021-12-08 14:38:40,439 ERROR scheduler.TaskSetManager: task 0.0 in stage > 15.0 (TID 31) had a not serializable result: > org.apache.hadoop.io.ShortWritable > Serialization stack: > - object not serializable (class: > [org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1) > - writeObject data (class: java.util.HashMap) > - object (class java.util.HashMap, \{price=1}) > - field (class: scala.Tuple2, name: _2, type: class java.lang.Object) > - object (class scala.Tuple2, (1,\{price=1})) > - element of array (index: 0) > - array (class [Lscala.Tuple2;, size 1); not retrying > Traceback (most recent call last): > File "", line 4, in > File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/context.py", > line 853, in newAPIHadoopRDD > jconf, batchSize) > File > "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", > line 1305, in __call__ > File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/sql/utils.py", > line 111, in deco > return f(*a, **kw) > File > "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. > : org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 > in stage 15.0 (TID 31) had a not serializable result: > org.apache.hadoop.io.ShortWritable > Serialization stack: > - object not serializable (class: > [org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1) > - writeObject data (class: java.util.HashMap) > - object (class java.util.HashMap, \{price=1}) > - field (class: scala.Tuple2, name: _2, type: class java.lang.Object) > - object (class scala.Tuple2, (1,\{price=1})) > - element of array (index: 0) > - array (class [Lscala.Tuple2;, size 1) > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at
[jira] [Created] (SPARK-37621) ClassCastException when trying to persist the result of a join between two Iceberg tables
Ciprian Gerea created SPARK-37621: - Summary: ClassCastException when trying to persist the result of a join between two Iceberg tables Key: SPARK-37621 URL: https://issues.apache.org/jira/browse/SPARK-37621 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 3.1.2 Reporter: Ciprian Gerea I am gettin an error when I try to persist the results on a Join operation. Note that both tables to be joined and the output table are Iceberg tables. SQL code to repro. String sqlJoin = String.format( "SELECT * from " + "((select %s from %s.%s where %s ) l " + "join (select %s from %s.%s where %s ) r " + "using (%s))", ); spark.sql(sqlJoin).writeTo("ciptest.ttt").option("write-format", "parquet").createOrReplace(); My exception stack is: {{Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow}} {{ at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$1.writeValue(UnsafeRowSerializer.scala:64)}} {{ at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249)}} {{ at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)}} {{ at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)}} {{ at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)}} {{ at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)}} {{ at org.apache.spark.scheduler.Task.run(Task.scala:131)}} {{ at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)}} {{ at ….}} Explain on the Sql statement gets the following plan: {{== Physical Plan ==}} {{Project [ ... ]}} {{+- SortMergeJoin […], Inner}} {{ :- Sort […], false, 0}} {{ : +- Exchange hashpartitioning(…, 10), ENSURE_REQUIREMENTS, [id=#38]}} {{ : +- Filter (…)}} {{ :+- BatchScan[... ] left [filters=…]}} {{ +- *(2) Sort […], false, 0}} {{ +- Exchange hashpartitioning(…, 10), ENSURE_REQUIREMENTS, [id=#47]}} {{ +- *(1) Filter (…)}} {{ +- BatchScan[…] right [filters=…] }} {{Note that several variations of this fail. Besides the repro code listed above I have tried doing CTAS and trying to write the result into parquet files without making a table out of it.}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37620) Use more precise types for SparkContext Optional fields (i.e. _gateway, _jvm)
[ https://issues.apache.org/jira/browse/SPARK-37620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458050#comment-17458050 ] Maciej Szymkiewicz commented on SPARK-37620: FYI [~ueshin] [~itholic] [~XinrongM] [~hyukjin.kwon] [~byronhsu] > Use more precise types for SparkContext Optional fields (i.e. _gateway, _jvm) > -- > > Key: SPARK-37620 > URL: https://issues.apache.org/jira/browse/SPARK-37620 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > As a part of SPARK-37152 [we > agreed|https://github.com/apache/spark/pull/34466/files#r762609181] to keep > these typed as not {{Optional}} for simplicity, but this just a temporary > solution to move things forward, until we decide on best approach to handle > such cases. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37620) Use more precise types for SparkContext Optional fields (i.e. _gateway, _jvm)
Maciej Szymkiewicz created SPARK-37620: -- Summary: Use more precise types for SparkContext Optional fields (i.e. _gateway, _jvm) Key: SPARK-37620 URL: https://issues.apache.org/jira/browse/SPARK-37620 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Maciej Szymkiewicz As a part of SPARK-37152 [we agreed|https://github.com/apache/spark/pull/34466/files#r762609181] to keep these typed as not {{Optional}} for simplicity, but this just a temporary solution to move things forward, until we decide on best approach to handle such cases. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37152) Inline type hints for python/pyspark/context.py
[ https://issues.apache.org/jira/browse/SPARK-37152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz resolved SPARK-37152. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34466 [https://github.com/apache/spark/pull/34466] > Inline type hints for python/pyspark/context.py > --- > > Key: SPARK-37152 > URL: https://issues.apache.org/jira/browse/SPARK-37152 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Assignee: Byron Hsu >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37152) Inline type hints for python/pyspark/context.py
[ https://issues.apache.org/jira/browse/SPARK-37152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz reassigned SPARK-37152: -- Assignee: Byron Hsu > Inline type hints for python/pyspark/context.py > --- > > Key: SPARK-37152 > URL: https://issues.apache.org/jira/browse/SPARK-37152 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Assignee: Byron Hsu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37619) Upgrade Maven to 3.8.4
[ https://issues.apache.org/jira/browse/SPARK-37619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37619: Assignee: (was: Apache Spark) > Upgrade Maven to 3.8.4 > -- > > Key: SPARK-37619 > URL: https://issues.apache.org/jira/browse/SPARK-37619 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37619) Upgrade Maven to 3.8.4
[ https://issues.apache.org/jira/browse/SPARK-37619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458019#comment-17458019 ] Apache Spark commented on SPARK-37619: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/34873 > Upgrade Maven to 3.8.4 > -- > > Key: SPARK-37619 > URL: https://issues.apache.org/jira/browse/SPARK-37619 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37619) Upgrade Maven to 3.8.4
[ https://issues.apache.org/jira/browse/SPARK-37619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37619: Assignee: Apache Spark > Upgrade Maven to 3.8.4 > -- > > Key: SPARK-37619 > URL: https://issues.apache.org/jira/browse/SPARK-37619 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37619) Upgrade Maven to 3.8.4
Dongjoon Hyun created SPARK-37619: - Summary: Upgrade Maven to 3.8.4 Key: SPARK-37619 URL: https://issues.apache.org/jira/browse/SPARK-37619 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37618) Support cleaning up shuffle blocks from external shuffle service
Adam Binford created SPARK-37618: Summary: Support cleaning up shuffle blocks from external shuffle service Key: SPARK-37618 URL: https://issues.apache.org/jira/browse/SPARK-37618 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 3.2.0 Reporter: Adam Binford Currently shuffle data is not cleaned up when an external shuffle service is used and the associated executor has been deallocated before the shuffle is cleaned up. Shuffle data is only cleaned up once the application ends. There have been various issues filed for this: https://issues.apache.org/jira/browse/SPARK-26020 https://issues.apache.org/jira/browse/SPARK-17233 https://issues.apache.org/jira/browse/SPARK-4236 But shuffle files will still stick around until an application completes. Dynamic allocation is commonly used for long running jobs (such as structured streaming), so any long running jobs with a large shuffle involved will eventually fill up local disk space. The shuffle service already supports cleaning up shuffle service persisted RDDs, so it should be able to support cleaning up shuffle blocks as well once the shuffle is removed by the ContextCleaner. The current alternative is to use shuffle tracking instead of an external shuffle service, but this is less optimal from a resource perspective as all executors must be kept alive until the shuffle has been fully consumed and cleaned up (and with the default GC interval being 30 minutes this can waste a lot of time with executors held onto but not doing anything). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37617) In CTAS,replace alias to normal column that satisfy parquet schema
[ https://issues.apache.org/jira/browse/SPARK-37617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37617: Assignee: Apache Spark > In CTAS,replace alias to normal column that satisfy parquet schema > -- > > Key: SPARK-37617 > URL: https://issues.apache.org/jira/browse/SPARK-37617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiahong.li >Assignee: Apache Spark >Priority: Minor > Fix For: 3.2.0 > > > In CTAS, Replace name columns that have not alias. Mostly, columns without > alias always is operator such as sum, divide that will lead to schema check > error. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37617) In CTAS,replace alias to normal column that satisfy parquet schema
[ https://issues.apache.org/jira/browse/SPARK-37617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37617: Assignee: (was: Apache Spark) > In CTAS,replace alias to normal column that satisfy parquet schema > -- > > Key: SPARK-37617 > URL: https://issues.apache.org/jira/browse/SPARK-37617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiahong.li >Priority: Minor > Fix For: 3.2.0 > > > In CTAS, Replace name columns that have not alias. Mostly, columns without > alias always is operator such as sum, divide that will lead to schema check > error. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37617) In CTAS,replace alias to normal column that satisfy parquet schema
[ https://issues.apache.org/jira/browse/SPARK-37617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457977#comment-17457977 ] Apache Spark commented on SPARK-37617: -- User 'monkeyboy123' has created a pull request for this issue: https://github.com/apache/spark/pull/34872 > In CTAS,replace alias to normal column that satisfy parquet schema > -- > > Key: SPARK-37617 > URL: https://issues.apache.org/jira/browse/SPARK-37617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiahong.li >Priority: Minor > Fix For: 3.2.0 > > > In CTAS, Replace name columns that have not alias. Mostly, columns without > alias always is operator such as sum, divide that will lead to schema check > error. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37617) In CTAS,replace alias to normal column that satisfy parquet schema
jiahong.li created SPARK-37617: -- Summary: In CTAS,replace alias to normal column that satisfy parquet schema Key: SPARK-37617 URL: https://issues.apache.org/jira/browse/SPARK-37617 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: jiahong.li Fix For: 3.2.0 In CTAS, Replace name columns that have not alias. Mostly, columns without alias always is operator such as sum, divide that will lead to schema check error. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37575) null values should be saved as nothing rather than quoted empty Strings "" by default settings
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guo Wei updated SPARK-37575: Summary: null values should be saved as nothing rather than quoted empty Strings "" by default settings (was: Null values are saved as quoted empty Strings "" rather than nothing) > null values should be saved as nothing rather than quoted empty Strings "" by > default settings > -- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Guo Wei >Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37616) Support pushing down a dynamic partition pruning from one join to other joins
[ https://issues.apache.org/jira/browse/SPARK-37616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37616: Assignee: (was: Apache Spark) > Support pushing down a dynamic partition pruning from one join to other joins > - > > Key: SPARK-37616 > URL: https://issues.apache.org/jira/browse/SPARK-37616 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.2.0 >Reporter: weixiuli >Priority: Major > > Support pushing down a dynamic partition pruning from one join to other joins -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37616) Support pushing down a dynamic partition pruning from one join to other joins
[ https://issues.apache.org/jira/browse/SPARK-37616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37616: Assignee: Apache Spark > Support pushing down a dynamic partition pruning from one join to other joins > - > > Key: SPARK-37616 > URL: https://issues.apache.org/jira/browse/SPARK-37616 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.2.0 >Reporter: weixiuli >Assignee: Apache Spark >Priority: Major > > Support pushing down a dynamic partition pruning from one join to other joins -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37616) Support pushing down a dynamic partition pruning from one join to other joins
[ https://issues.apache.org/jira/browse/SPARK-37616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457928#comment-17457928 ] Apache Spark commented on SPARK-37616: -- User 'weixiuli' has created a pull request for this issue: https://github.com/apache/spark/pull/34871 > Support pushing down a dynamic partition pruning from one join to other joins > - > > Key: SPARK-37616 > URL: https://issues.apache.org/jira/browse/SPARK-37616 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.2.0 >Reporter: weixiuli >Priority: Major > > Support pushing down a dynamic partition pruning from one join to other joins -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37616) Support pushing down a dynamic partition pruning from one join to other joins
[ https://issues.apache.org/jira/browse/SPARK-37616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457929#comment-17457929 ] Apache Spark commented on SPARK-37616: -- User 'weixiuli' has created a pull request for this issue: https://github.com/apache/spark/pull/34871 > Support pushing down a dynamic partition pruning from one join to other joins > - > > Key: SPARK-37616 > URL: https://issues.apache.org/jira/browse/SPARK-37616 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.2.0 >Reporter: weixiuli >Priority: Major > > Support pushing down a dynamic partition pruning from one join to other joins -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37616) Support pushing down a dynamic partition pruning from one join to other joins
weixiuli created SPARK-37616: Summary: Support pushing down a dynamic partition pruning from one join to other joins Key: SPARK-37616 URL: https://issues.apache.org/jira/browse/SPARK-37616 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0, 3.1.2, 3.1.1 Reporter: weixiuli Support pushing down a dynamic partition pruning from one join to other joins -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org