[jira] [Assigned] (SPARK-34567) CreateTableAsSelect should have metrics update too
[ https://issues.apache.org/jira/browse/SPARK-34567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34567: Assignee: (was: Apache Spark) > CreateTableAsSelect should have metrics update too > -- > > Key: SPARK-34567 > URL: https://issues.apache.org/jira/browse/SPARK-34567 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34567) CreateTableAsSelect should have metrics update too
[ https://issues.apache.org/jira/browse/SPARK-34567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34567: Assignee: Apache Spark > CreateTableAsSelect should have metrics update too > -- > > Key: SPARK-34567 > URL: https://issues.apache.org/jira/browse/SPARK-34567 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34567) CreateTableAsSelect should have metrics update too
[ https://issues.apache.org/jira/browse/SPARK-34567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292325#comment-17292325 ] Apache Spark commented on SPARK-34567: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/31679 > CreateTableAsSelect should have metrics update too > -- > > Key: SPARK-34567 > URL: https://issues.apache.org/jira/browse/SPARK-34567 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34567) CreateTableAsSelect should have metrics update too
[ https://issues.apache.org/jira/browse/SPARK-34567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292324#comment-17292324 ] Apache Spark commented on SPARK-34567: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/31679 > CreateTableAsSelect should have metrics update too > -- > > Key: SPARK-34567 > URL: https://issues.apache.org/jira/browse/SPARK-34567 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34567) CreateTableAsSelect should have metrics update too
angerszhu created SPARK-34567: - Summary: CreateTableAsSelect should have metrics update too Key: SPARK-34567 URL: https://issues.apache.org/jira/browse/SPARK-34567 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34567) CreateTableAsSelect should have metrics update too
[ https://issues.apache.org/jira/browse/SPARK-34567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292301#comment-17292301 ] angerszhu commented on SPARK-34567: --- raise a pr soon > CreateTableAsSelect should have metrics update too > -- > > Key: SPARK-34567 > URL: https://issues.apache.org/jira/browse/SPARK-34567 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34543) Respect case sensitivity in V1 ALTER TABLE .. SET LOCATION
[ https://issues.apache.org/jira/browse/SPARK-34543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34543: -- Fix Version/s: 3.0.3 > Respect case sensitivity in V1 ALTER TABLE .. SET LOCATION > -- > > Key: SPARK-34543 > URL: https://issues.apache.org/jira/browse/SPARK-34543 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0, 3.1.2, 3.0.3 > > > SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config > *spark.sql.caseSensitive* which is false by default, for instance: > {code:sql} > spark-sql> CREATE TABLE tbl (id INT, part INT) PARTITIONED BY (part); > spark-sql> INSERT INTO tbl PARTITION (part=0) SELECT 0; > spark-sql> SHOW TABLE EXTENDED LIKE 'tbl' PARTITION (part=0); > Location: > file:/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/part=0 > spark-sql> ALTER TABLE tbl ADD PARTITION (part=1); > spark-sql> SELECT * FROM tbl; > 0 0 > {code} > Create new partition folder in the file system: > {code} > $ cp -r > /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/part=0 > /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/aaa > {code} > Set new location for the partition part=1: > {code:sql} > spark-sql> ALTER TABLE tbl PARTITION (part=1) SET LOCATION > '/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/aaa'; > spark-sql> SELECT * FROM tbl; > 0 0 > 0 1 > spark-sql> ALTER TABLE tbl ADD PARTITION (PART=2); > spark-sql> SELECT * FROM tbl; > 0 0 > 0 1 > {code} > Set location for a partition in the upper case: > {code} > $ cp -r > /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/part=0 > /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/bbb > {code} > {code:sql} > spark-sql> ALTER TABLE tbl PARTITION (PART=2) SET LOCATION > '/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/bbb'; > Error in query: Partition spec is invalid. The spec (PART) must match the > partition spec (part) defined in table '`default`.`tbl`' > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34566) Fix typo error of `spark.launcher.childConectionTimeout`
[ https://issues.apache.org/jira/browse/SPARK-34566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34566: Assignee: Apache Spark > Fix typo error of `spark.launcher.childConectionTimeout` > > > Key: SPARK-34566 > URL: https://issues.apache.org/jira/browse/SPARK-34566 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34566) Fix typo error of `spark.launcher.childConectionTimeout`
[ https://issues.apache.org/jira/browse/SPARK-34566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292290#comment-17292290 ] Apache Spark commented on SPARK-34566: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/31678 > Fix typo error of `spark.launcher.childConectionTimeout` > > > Key: SPARK-34566 > URL: https://issues.apache.org/jira/browse/SPARK-34566 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34566) Fix typo error of `spark.launcher.childConectionTimeout`
[ https://issues.apache.org/jira/browse/SPARK-34566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34566: Assignee: (was: Apache Spark) > Fix typo error of `spark.launcher.childConectionTimeout` > > > Key: SPARK-34566 > URL: https://issues.apache.org/jira/browse/SPARK-34566 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34566) Fix typo error of `spark.launcher.childConectionTimeout`
angerszhu created SPARK-34566: - Summary: Fix typo error of `spark.launcher.childConectionTimeout` Key: SPARK-34566 URL: https://issues.apache.org/jira/browse/SPARK-34566 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.0, 3.1.1 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34562) Leverage parquet bloom filters
[ https://issues.apache.org/jira/browse/SPARK-34562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292256#comment-17292256 ] Yuming Wang commented on SPARK-34562: - [~h-vetinari] This is an example to build bloom filter: https://issues.apache.org/jira/browse/PARQUET-41?focusedCommentId=17276854&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17276854 > Leverage parquet bloom filters > -- > > Key: SPARK-34562 > URL: https://issues.apache.org/jira/browse/SPARK-34562 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: H. Vetinari >Priority: Major > > The currently in-progress SPARK-34542 brings in parquet 1.12, which contains > PARQUET-41. > From searching the issues, it seems there is no current tracker for this, > though I found a > [comment|https://issues.apache.org/jira/browse/SPARK-20901?focusedCommentId=17052473&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17052473] > from [~dongjoon] that points out the missing parquet support up until now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34562) Leverage parquet bloom filters
[ https://issues.apache.org/jira/browse/SPARK-34562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34562: Component/s: (was: Input/Output) SQL > Leverage parquet bloom filters > -- > > Key: SPARK-34562 > URL: https://issues.apache.org/jira/browse/SPARK-34562 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: H. Vetinari >Priority: Major > > The currently in-progress SPARK-34542 brings in parquet 1.12, which contains > PARQUET-41. > From searching the issues, it seems there is no current tracker for this, > though I found a > [comment|https://issues.apache.org/jira/browse/SPARK-20901?focusedCommentId=17052473&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17052473] > from [~dongjoon] that points out the missing parquet support up until now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34565) Collapse Window nodes with Project between them
[ https://issues.apache.org/jira/browse/SPARK-34565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292251#comment-17292251 ] Apache Spark commented on SPARK-34565: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/31677 > Collapse Window nodes with Project between them > --- > > Key: SPARK-34565 > URL: https://issues.apache.org/jira/browse/SPARK-34565 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Major > > The CollapseWindow optimizer rule can be imroved to also collapse Window > nodes, that have a Project between them. This sort of Window - Project - > Window chains will happen when chaining the dataframe.withColumn calls. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34565) Collapse Window nodes with Project between them
[ https://issues.apache.org/jira/browse/SPARK-34565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34565: Assignee: Apache Spark > Collapse Window nodes with Project between them > --- > > Key: SPARK-34565 > URL: https://issues.apache.org/jira/browse/SPARK-34565 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Assignee: Apache Spark >Priority: Major > > The CollapseWindow optimizer rule can be imroved to also collapse Window > nodes, that have a Project between them. This sort of Window - Project - > Window chains will happen when chaining the dataframe.withColumn calls. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34565) Collapse Window nodes with Project between them
[ https://issues.apache.org/jira/browse/SPARK-34565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292250#comment-17292250 ] Apache Spark commented on SPARK-34565: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/31677 > Collapse Window nodes with Project between them > --- > > Key: SPARK-34565 > URL: https://issues.apache.org/jira/browse/SPARK-34565 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Major > > The CollapseWindow optimizer rule can be imroved to also collapse Window > nodes, that have a Project between them. This sort of Window - Project - > Window chains will happen when chaining the dataframe.withColumn calls. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34565) Collapse Window nodes with Project between them
[ https://issues.apache.org/jira/browse/SPARK-34565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34565: Assignee: (was: Apache Spark) > Collapse Window nodes with Project between them > --- > > Key: SPARK-34565 > URL: https://issues.apache.org/jira/browse/SPARK-34565 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Major > > The CollapseWindow optimizer rule can be imroved to also collapse Window > nodes, that have a Project between them. This sort of Window - Project - > Window chains will happen when chaining the dataframe.withColumn calls. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34565) Collapse Window nodes with Project between them
Tanel Kiis created SPARK-34565: -- Summary: Collapse Window nodes with Project between them Key: SPARK-34565 URL: https://issues.apache.org/jira/browse/SPARK-34565 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Tanel Kiis The CollapseWindow optimizer rule can be imroved to also collapse Window nodes, that have a Project between them. This sort of Window - Project - Window chains will happen when chaining the dataframe.withColumn calls. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34542) Upgrade Parquet to 1.12.0
[ https://issues.apache.org/jira/browse/SPARK-34542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292239#comment-17292239 ] H. Vetinari commented on SPARK-34542: - Would be amazing if this could make it into 3.2, given all the features in parquet 1.12 (e.g. bloom filters & encryption). > Upgrade Parquet to 1.12.0 > - > > Key: SPARK-34542 > URL: https://issues.apache.org/jira/browse/SPARK-34542 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > Parquet-1.12.0 release notes: > https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0-rc2/CHANGES.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34564) DateTimeUtils.fromJavaDate fails for very late dates during casting to Int
[ https://issues.apache.org/jira/browse/SPARK-34564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kondziolka9ld updated SPARK-34564: -- Description: Please consider a following scenario on *spark-3.0.1*: {code:java} scala> List(("some date", new Date(Int.MaxValue)), ("some corner case date", new Date(Long.MaxValue))).toDF java.lang.RuntimeException: Error while encoding: java.lang.ArithmeticException: integer overflow staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#0 staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, fromJavaDate, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#1 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:215) at org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:466) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:466) at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:353) at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:231) ... 51 elided Caused by: java.lang.ArithmeticException: integer overflow at java.lang.Math.toIntExact(Math.java:1011) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.fromJavaDate(DateTimeUtils.scala:111) at org.apache.spark.sql.catalyst.util.DateTimeUtils.fromJavaDate(DateTimeUtils.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:211) ... 60 more {code} In opposition to *spark-2.4.7* where it is possible to create dataframe with such values: {code:java} scala> val df = List(("some date", new Date(Int.MaxValue)), ("some corner case date", new Date(Long.MaxValue))).toDF df: org.apache.spark.sql.DataFrame = [_1: string, _2: date]scala> df.show ++-+ | _1| _2| ++-+ | some date| 1970-01-25| |some corner case ...|1701498-03-18| ++-+ {code} Anyway, I am aware of the fact that during collecting these data I will got another result: {code:java} scala> df.collect res10: Array[org.apache.spark.sql.Row] = Array([some date,1970-01-25], [some corner case date,?498-03-18]) {code} what seems to be natural because of behaviour of *java.sql.Date*: {code:java} scala> new java.sql.Date(Long.MaxValue) res1: java.sql.Date = ?994-08-17 {code} When it comes to easier reproduction, please consider: {code:java} scala> org.apache.spark.sql.catalyst.util.DateTimeUtils.fromJavaDate(new java.sql.Date(Long.MaxValue)) java.lang.ArithmeticException: integer overflow at java.lang.Math.toIntExact(Math.java:1011) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.fromJavaDate(DateTimeUtils.scala:111) ... 47 elided {code} However, the question is even if such late dates are not supported, could it fail in more gentle way? was: Please consider a following scenario on *spark-3.0.1*: {code:java} scala> List(("some date", new Date(Int.MaxValue)), ("some corner case date", new Date(Long.MaxValue))).toDF java.lang.RuntimeException: Error while encoding: java.lang.ArithmeticException: integer overflow staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#0 staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, fromJavaDate, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#1 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:215) at org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:466) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:466) at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:353) at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:231) ..
[jira] [Updated] (SPARK-34564) DateTimeUtils.fromJavaDate fails for very late dates during casting to Int
[ https://issues.apache.org/jira/browse/SPARK-34564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kondziolka9ld updated SPARK-34564: -- Description: Please consider a following scenario on *spark-3.0.1*: {code:java} scala> List(("some date", new Date(Int.MaxValue)), ("some corner case date", new Date(Long.MaxValue))).toDF java.lang.RuntimeException: Error while encoding: java.lang.ArithmeticException: integer overflow staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#0 staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, fromJavaDate, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#1 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:215) at org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:466) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:466) at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:353) at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:231) ... 51 elided Caused by: java.lang.ArithmeticException: integer overflow at java.lang.Math.toIntExact(Math.java:1011) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.fromJavaDate(DateTimeUtils.scala:111) at org.apache.spark.sql.catalyst.util.DateTimeUtils.fromJavaDate(DateTimeUtils.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:211) ... 60 more {code} In opposition to *spark-2.4.7* where it is possible to create dataframe with such values: {code:java} scala> val df = List(("some date", new Date(Int.MaxValue)), ("some corner case date", new Date(Long.MaxValue))).toDF df: org.apache.spark.sql.DataFrame = [_1: string, _2: date]scala> df.show ++-+ | _1| _2| ++-+ | some date| 1970-01-25| |some corner case ...|1701498-03-18| ++-+ {code} Anyway, I am aware of the fact that during collecting these data I will got another result: {code:java} scala> df.collect res10: Array[org.apache.spark.sql.Row] = Array([some date,1970-01-25], [some corner case date,?498-03-18]) {code} what seems to be natural as: {code:java} scala> new java.sql.Date(Long.MaxValue) res1: java.sql.Date = ?994-08-17 {code} When it comes to easier reproduction, please consider: {code:java} scala> org.apache.spark.sql.catalyst.util.DateTimeUtils.fromJavaDate(new java.sql.Date(Long.MaxValue)) java.lang.ArithmeticException: integer overflow at java.lang.Math.toIntExact(Math.java:1011) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.fromJavaDate(DateTimeUtils.scala:111) ... 47 elided {code} However, the question is even if such late dates are not supported, could it fail in more gentle way? was: Please consider a following scenario on *spark-3.0.1*: {code:java} scala> List(("some date", new Date(Int.MaxValue)), ("some corner case date", new Date(Long.MaxValue))).toDF java.lang.RuntimeException: Error while encoding: java.lang.ArithmeticException: integer overflow staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#0 staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, fromJavaDate, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#1 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:215) at org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:466) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:466) at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:353) at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:231) ... 51 elided Caused by: java.lang.A
[jira] [Created] (SPARK-34564) DateTimeUtils.fromJavaDate fails for very late dates during casting to Int
kondziolka9ld created SPARK-34564: - Summary: DateTimeUtils.fromJavaDate fails for very late dates during casting to Int Key: SPARK-34564 URL: https://issues.apache.org/jira/browse/SPARK-34564 Project: Spark Issue Type: Question Components: SQL Affects Versions: 3.0.1 Reporter: kondziolka9ld Please consider a following scenario on *spark-3.0.1*: {code:java} scala> List(("some date", new Date(Int.MaxValue)), ("some corner case date", new Date(Long.MaxValue))).toDF java.lang.RuntimeException: Error while encoding: java.lang.ArithmeticException: integer overflow staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#0 staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, fromJavaDate, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#1 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:215) at org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:466) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:466) at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:353) at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:231) ... 51 elided Caused by: java.lang.ArithmeticException: integer overflow at java.lang.Math.toIntExact(Math.java:1011) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.fromJavaDate(DateTimeUtils.scala:111) at org.apache.spark.sql.catalyst.util.DateTimeUtils.fromJavaDate(DateTimeUtils.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:211) ... 60 more {code} In opposition to *spark-2.4.7* where it is possible to create dataframe with such values: {code:java} scala> val df = List(("some date", new Date(Int.MaxValue)), ("some corner case date", new Date(Long.MaxValue))).toDF df: org.apache.spark.sql.DataFrame = [_1: string, _2: date]scala> df.show ++-+ | _1| _2| ++-+ | some date| 1970-01-25| |some corner case ...|1701498-03-18| ++-+ {code} Anyway, I am aware of the fact that during collecting these data I will got another result: {code:java} scala> df.collect res10: Array[org.apache.spark.sql.Row] = Array([some date,1970-01-25], [some corner case date,?498-03-18]) {code} what seems to be natural as: {code:java} scala> new java.sql.Date(Long.MaxValue) res1: java.sql.Date = ?994-08-17 {code} When it comes to easier reproduction, please consider: {code:java} scala> org.apache.spark.sql.catalyst.util.DateTimeUtils.fromJavaDate(new java.sql.Date(Long.MaxValue)) java.lang.ArithmeticException: integer overflow at java.lang.Math.toIntExact(Math.java:1011) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.fromJavaDate(DateTimeUtils.scala:111) ... 47 elided {code} However, the question is even if such late dates are not supported, could it fail in more gentle way? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34563) Checkpointing a union with another checkpoint fails
[ https://issues.apache.org/jira/browse/SPARK-34563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Kamprath updated SPARK-34563: - Description: I have some PySpark code that periodically checkpoints a data frame that I am building in pieces by union-ing those pieces together as they are constructed. (Py)Spark fails on the second checkpoint, which would be a union of a new piece of the desired data frame with a previously checkpointed piece. Some simplified PySpark code that will trigger this problem is: {code:java} RANGE_STEP = 1 PARTITIONS = 5 COUNT_UNIONS = 20 df = spark.range(1, RANGE_STEP+1, numPartitions=PARTITIONS) for i in range(1, COUNT_UNIONS+1): print('Processing i = {0}'.format(i)) new_df = spark.range(RANGE_STEP*i + 1, RANGE_STEP*(i+1) + 1, numPartitions=PARTITIONS) df = df.union(new_df).checkpoint() df.count() {code} When this code gets to the checkpoint on the second loop iteration (i=2) the job fails with an error: {code:java} Py4JJavaError: An error occurred while calling o119.checkpoint. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 10.0 failed 4 times, most recent failure: Lost task 9.3 in stage 10.0 (TID 264, 10.20.30.13, executor 0): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9062 at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:296) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804) at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227) at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2154) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2135) at org.apache.spark.SparkContext.runJob(SparkContext.sc
[jira] [Created] (SPARK-34563) Checkpointing a union with another checkpoint fails
Michael Kamprath created SPARK-34563: Summary: Checkpointing a union with another checkpoint fails Key: SPARK-34563 URL: https://issues.apache.org/jira/browse/SPARK-34563 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.2 Environment: I am running Spark 3.0.2 in stand alone cluster mode, built for Hadoop 2.7, and Scala 2.12.12. I am using QFS 2.2.2 (Quantcast File System) as the underlying DFS. The nodes run on Debian Stretch, and Java is openjdk version "1.8.0_275". Reporter: Michael Kamprath I have some PySpark code that periodically checkpoints a data frame that I am building in pieces by union-ing those pieces together as they are constructed. (Py)Spark fails on the second checkpoint, which would be a union of a new piece of the desired data frame with a previously checkpointed piece. Some simplified PySpark code that will trigger this problem is: {code:java} RANGE_STEP = 1 PARTITIONS = 5 COUNT_UNIONS = 20 df = spark.range(1, RANGE_STEP+1, numPartitions=PARTITIONS) for i in range(1, COUNT_UNIONS+1): print('Processing i = {0}'.format(i)) new_df = spark.range(RANGE_STEP*i + 1, RANGE_STEP*(i+1) + 1, numPartitions=PARTITIONS) df = df.union(new_df).checkpoint() df.count() {code} When this code gets to the checkpoint on the second loop iteration (i=2) the job fails with an error: {code:java} Py4JJavaError: An error occurred while calling o119.checkpoint. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 10.0 failed 4 times, most recent failure: Lost task 9.3 in stage 10.0 (TID 264, 10.20.30.13, executor 0): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9062 at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:296) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804) at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227) at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2154) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188) at org.ap
[jira] [Assigned] (SPARK-34479) Add zstandard codec to spark.sql.avro.compression.codec
[ https://issues.apache.org/jira/browse/SPARK-34479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34479: - Assignee: Yuming Wang > Add zstandard codec to spark.sql.avro.compression.codec > --- > > Key: SPARK-34479 > URL: https://issues.apache.org/jira/browse/SPARK-34479 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Avro add zstandard codec since AVRO-2195. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34479) Add zstandard codec to spark.sql.avro.compression.codec
[ https://issues.apache.org/jira/browse/SPARK-34479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34479. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31673 [https://github.com/apache/spark/pull/31673] > Add zstandard codec to spark.sql.avro.compression.codec > --- > > Key: SPARK-34479 > URL: https://issues.apache.org/jira/browse/SPARK-34479 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > Avro add zstandard codec since AVRO-2195. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34562) Leverage parquet bloom filters
[ https://issues.apache.org/jira/browse/SPARK-34562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] H. Vetinari updated SPARK-34562: Description: The currently in-progress SPARK-34542 brings in parquet 1.12, which contains PARQUET-41. >From searching the issues, it seems there is no current tracker for this, >though I found a >[comment|https://issues.apache.org/jira/browse/SPARK-20901?focusedCommentId=17052473&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17052473] > from [~dongjoon] that points out the missing parquet support up until now. was: The currently in-progress SPARK-34542 brings in parquet 1.12, which contains PARQUET-41. >From searching the issues, it seems there is no current tracker for this, >though I found a comment from [~dongjoon] that points out the missing parquet >support up until now. > Leverage parquet bloom filters > -- > > Key: SPARK-34562 > URL: https://issues.apache.org/jira/browse/SPARK-34562 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.2.0 >Reporter: H. Vetinari >Priority: Major > > The currently in-progress SPARK-34542 brings in parquet 1.12, which contains > PARQUET-41. > From searching the issues, it seems there is no current tracker for this, > though I found a > [comment|https://issues.apache.org/jira/browse/SPARK-20901?focusedCommentId=17052473&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17052473] > from [~dongjoon] that points out the missing parquet support up until now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34562) Leverage parquet bloom filters
[ https://issues.apache.org/jira/browse/SPARK-34562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] H. Vetinari updated SPARK-34562: Issue Type: Improvement (was: Task) > Leverage parquet bloom filters > -- > > Key: SPARK-34562 > URL: https://issues.apache.org/jira/browse/SPARK-34562 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.2.0 >Reporter: H. Vetinari >Priority: Major > > The currently in-progress SPARK-34542 brings in parquet 1.12, which contains > PARQUET-41. > From searching the issues, it seems there is no current tracker for this, > though I found a comment from [~dongjoon] that points out the missing parquet > support up until now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34562) Leverage parquet bloom filters
H. Vetinari created SPARK-34562: --- Summary: Leverage parquet bloom filters Key: SPARK-34562 URL: https://issues.apache.org/jira/browse/SPARK-34562 Project: Spark Issue Type: Task Components: Input/Output Affects Versions: 3.2.0 Reporter: H. Vetinari The currently in-progress SPARK-34542 brings in parquet 1.12, which contains PARQUET-41. >From searching the issues, it seems there is no current tracker for this, >though I found a comment from [~dongjoon] that points out the missing parquet >support up until now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34561) Cannot drop/add columns from/to a dataset of v2 `DESCRIBE TABLE`
[ https://issues.apache.org/jira/browse/SPARK-34561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292192#comment-17292192 ] Apache Spark commented on SPARK-34561: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31676 > Cannot drop/add columns from/to a dataset of v2 `DESCRIBE TABLE` > > > Key: SPARK-34561 > URL: https://issues.apache.org/jira/browse/SPARK-34561 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Dropping a column from a dataset of v2 `DESCRIBE TABLE` fails with: > {code:java} > Resolved attribute(s) col_name#102,data_type#103 missing from > col_name#29,data_type#30,comment#31 in operator !Project [col_name#102, > data_type#103]. Attribute(s) with the same name appear in the operation: > col_name,data_type. Please check if the right attribute(s) are used.; > !Project [col_name#102, data_type#103] > +- LocalRelation [col_name#29, data_type#30, comment#31]{code} > The code below demonstrates the issue: > {code:java} > val tbl = s"${catalogAndNamespace}tbl" > withTable(tbl) { > sql(s"CREATE TABLE $tbl (c0 INT) USING $v2Format") > val description = sql(s"DESCRIBE TABLE $tbl") > val noComment = description.drop("comment") > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34561) Cannot drop/add columns from/to a dataset of v2 `DESCRIBE TABLE`
[ https://issues.apache.org/jira/browse/SPARK-34561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34561: Assignee: Apache Spark > Cannot drop/add columns from/to a dataset of v2 `DESCRIBE TABLE` > > > Key: SPARK-34561 > URL: https://issues.apache.org/jira/browse/SPARK-34561 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Dropping a column from a dataset of v2 `DESCRIBE TABLE` fails with: > {code:java} > Resolved attribute(s) col_name#102,data_type#103 missing from > col_name#29,data_type#30,comment#31 in operator !Project [col_name#102, > data_type#103]. Attribute(s) with the same name appear in the operation: > col_name,data_type. Please check if the right attribute(s) are used.; > !Project [col_name#102, data_type#103] > +- LocalRelation [col_name#29, data_type#30, comment#31]{code} > The code below demonstrates the issue: > {code:java} > val tbl = s"${catalogAndNamespace}tbl" > withTable(tbl) { > sql(s"CREATE TABLE $tbl (c0 INT) USING $v2Format") > val description = sql(s"DESCRIBE TABLE $tbl") > val noComment = description.drop("comment") > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34561) Cannot drop/add columns from/to a dataset of v2 `DESCRIBE TABLE`
[ https://issues.apache.org/jira/browse/SPARK-34561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34561: Assignee: (was: Apache Spark) > Cannot drop/add columns from/to a dataset of v2 `DESCRIBE TABLE` > > > Key: SPARK-34561 > URL: https://issues.apache.org/jira/browse/SPARK-34561 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Dropping a column from a dataset of v2 `DESCRIBE TABLE` fails with: > {code:java} > Resolved attribute(s) col_name#102,data_type#103 missing from > col_name#29,data_type#30,comment#31 in operator !Project [col_name#102, > data_type#103]. Attribute(s) with the same name appear in the operation: > col_name,data_type. Please check if the right attribute(s) are used.; > !Project [col_name#102, data_type#103] > +- LocalRelation [col_name#29, data_type#30, comment#31]{code} > The code below demonstrates the issue: > {code:java} > val tbl = s"${catalogAndNamespace}tbl" > withTable(tbl) { > sql(s"CREATE TABLE $tbl (c0 INT) USING $v2Format") > val description = sql(s"DESCRIBE TABLE $tbl") > val noComment = description.drop("comment") > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34561) Cannot drop/add columns from/to a dataset of v2 `DESCRIBE TABLE`
Maxim Gekk created SPARK-34561: -- Summary: Cannot drop/add columns from/to a dataset of v2 `DESCRIBE TABLE` Key: SPARK-34561 URL: https://issues.apache.org/jira/browse/SPARK-34561 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk Dropping a column from a dataset of v2 `DESCRIBE TABLE` fails with: {code:java} Resolved attribute(s) col_name#102,data_type#103 missing from col_name#29,data_type#30,comment#31 in operator !Project [col_name#102, data_type#103]. Attribute(s) with the same name appear in the operation: col_name,data_type. Please check if the right attribute(s) are used.; !Project [col_name#102, data_type#103] +- LocalRelation [col_name#29, data_type#30, comment#31]{code} The code below demonstrates the issue: {code:java} val tbl = s"${catalogAndNamespace}tbl" withTable(tbl) { sql(s"CREATE TABLE $tbl (c0 INT) USING $v2Format") val description = sql(s"DESCRIBE TABLE $tbl") val noComment = description.drop("comment") } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34415) Use randomization as a possibly better technique than grid search in optimizing hyperparameters
[ https://issues.apache.org/jira/browse/SPARK-34415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34415. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31535 [https://github.com/apache/spark/pull/31535] > Use randomization as a possibly better technique than grid search in > optimizing hyperparameters > --- > > Key: SPARK-34415 > URL: https://issues.apache.org/jira/browse/SPARK-34415 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 3.0.1 >Reporter: Phillip Henry >Assignee: Phillip Henry >Priority: Minor > Labels: pull-request-available > Fix For: 3.2.0 > > > Randomization can be a more effective techinique than a grid search in > finding optimal hyperparameters since min/max points can fall between the > grid lines and never be found. Randomisation is not so restricted although > the probability of finding minima/maxima is dependent on the number of > attempts. > Alice Zheng has an accessible description on how this technique works at > [https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html] > (Note that I have a PR for this work outstanding at > [https://github.com/apache/spark/pull/31535] ) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34415) Use randomization as a possibly better technique than grid search in optimizing hyperparameters
[ https://issues.apache.org/jira/browse/SPARK-34415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-34415: Assignee: Phillip Henry > Use randomization as a possibly better technique than grid search in > optimizing hyperparameters > --- > > Key: SPARK-34415 > URL: https://issues.apache.org/jira/browse/SPARK-34415 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 3.0.1 >Reporter: Phillip Henry >Assignee: Phillip Henry >Priority: Minor > Labels: pull-request-available > > Randomization can be a more effective techinique than a grid search in > finding optimal hyperparameters since min/max points can fall between the > grid lines and never be found. Randomisation is not so restricted although > the probability of finding minima/maxima is dependent on the number of > attempts. > Alice Zheng has an accessible description on how this technique works at > [https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html] > (Note that I have a PR for this work outstanding at > [https://github.com/apache/spark/pull/31535] ) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34392) Invalid ID for offset-based ZoneId since Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-34392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-34392: - Fix Version/s: 3.1.2 > Invalid ID for offset-based ZoneId since Spark 3.0 > -- > > Key: SPARK-34392 > URL: https://issues.apache.org/jira/browse/SPARK-34392 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Yuming Wang >Assignee: karl wang >Priority: Major > Fix For: 3.2.0, 3.1.2 > > > How to reproduce this issue: > {code:sql} > select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00"); > {code} > Spark 2.4: > {noformat} > spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00"); > 2020-02-07 08:00:00 > Time taken: 0.089 seconds, Fetched 1 row(s) > {noformat} > Spark 3.x: > {noformat} > spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00"); > 21/02/07 01:24:32 ERROR SparkSQLDriver: Failed in [select > to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00")] > java.time.DateTimeException: Invalid ID for offset-based ZoneId: GMT+8:00 > at java.time.ZoneId.ofWithPrefix(ZoneId.java:437) > at java.time.ZoneId.of(ZoneId.java:407) > at java.time.ZoneId.of(ZoneId.java:359) > at java.time.ZoneId.of(ZoneId.java:315) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.getZoneId(DateTimeUtils.scala:53) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.toUTCTime(DateTimeUtils.scala:814) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34560) Cannot join datasets of SHOW TABLES
[ https://issues.apache.org/jira/browse/SPARK-34560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34560: Assignee: (was: Apache Spark) > Cannot join datasets of SHOW TABLES > --- > > Key: SPARK-34560 > URL: https://issues.apache.org/jira/browse/SPARK-34560 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > The example portraits the issue: > {code:scala} > scala> sql("CREATE NAMESPACE ns1") > res8: org.apache.spark.sql.DataFrame = [] > scala> sql("CREATE NAMESPACE ns2") > res9: org.apache.spark.sql.DataFrame = [] > scala> sql("CREATE TABLE ns1.tbl1 (c INT)") > res10: org.apache.spark.sql.DataFrame = [] > scala> sql("CREATE TABLE ns2.tbl2 (c INT)") > res11: org.apache.spark.sql.DataFrame = [] > scala> val show1 = sql("SHOW TABLES IN ns1") > show1: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string > ... 1 more field] > scala> val show2 = sql("SHOW TABLES IN ns2") > show2: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string > ... 1 more field] > scala> show1.show > +-+-+---+ > |namespace|tableName|isTemporary| > +-+-+---+ > | ns1| tbl1| false| > +-+-+---+ > scala> show2.show > +-+-+---+ > |namespace|tableName|isTemporary| > +-+-+---+ > | ns2| tbl2| false| > +-+-+---+ > scala> show1.join(show2).where(show1("tableName") =!= show2("tableName")).show > org.apache.spark.sql.AnalysisException: Column tableName#17 are ambiguous. > It's probably because you joined several Datasets together, and some of these > Datasets are the same. This column points to one of the Datasets but Spark is > unable to figure out which one. Please alias the Datasets with different > names via `Dataset.as` before joining them, and specify the column using > qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You > can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable > this check. > at > org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34560) Cannot join datasets of SHOW TABLES
[ https://issues.apache.org/jira/browse/SPARK-34560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292128#comment-17292128 ] Apache Spark commented on SPARK-34560: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31675 > Cannot join datasets of SHOW TABLES > --- > > Key: SPARK-34560 > URL: https://issues.apache.org/jira/browse/SPARK-34560 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > The example portraits the issue: > {code:scala} > scala> sql("CREATE NAMESPACE ns1") > res8: org.apache.spark.sql.DataFrame = [] > scala> sql("CREATE NAMESPACE ns2") > res9: org.apache.spark.sql.DataFrame = [] > scala> sql("CREATE TABLE ns1.tbl1 (c INT)") > res10: org.apache.spark.sql.DataFrame = [] > scala> sql("CREATE TABLE ns2.tbl2 (c INT)") > res11: org.apache.spark.sql.DataFrame = [] > scala> val show1 = sql("SHOW TABLES IN ns1") > show1: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string > ... 1 more field] > scala> val show2 = sql("SHOW TABLES IN ns2") > show2: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string > ... 1 more field] > scala> show1.show > +-+-+---+ > |namespace|tableName|isTemporary| > +-+-+---+ > | ns1| tbl1| false| > +-+-+---+ > scala> show2.show > +-+-+---+ > |namespace|tableName|isTemporary| > +-+-+---+ > | ns2| tbl2| false| > +-+-+---+ > scala> show1.join(show2).where(show1("tableName") =!= show2("tableName")).show > org.apache.spark.sql.AnalysisException: Column tableName#17 are ambiguous. > It's probably because you joined several Datasets together, and some of these > Datasets are the same. This column points to one of the Datasets but Spark is > unable to figure out which one. Please alias the Datasets with different > names via `Dataset.as` before joining them, and specify the column using > qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You > can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable > this check. > at > org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34560) Cannot join datasets of SHOW TABLES
[ https://issues.apache.org/jira/browse/SPARK-34560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292129#comment-17292129 ] Apache Spark commented on SPARK-34560: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31675 > Cannot join datasets of SHOW TABLES > --- > > Key: SPARK-34560 > URL: https://issues.apache.org/jira/browse/SPARK-34560 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > The example portraits the issue: > {code:scala} > scala> sql("CREATE NAMESPACE ns1") > res8: org.apache.spark.sql.DataFrame = [] > scala> sql("CREATE NAMESPACE ns2") > res9: org.apache.spark.sql.DataFrame = [] > scala> sql("CREATE TABLE ns1.tbl1 (c INT)") > res10: org.apache.spark.sql.DataFrame = [] > scala> sql("CREATE TABLE ns2.tbl2 (c INT)") > res11: org.apache.spark.sql.DataFrame = [] > scala> val show1 = sql("SHOW TABLES IN ns1") > show1: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string > ... 1 more field] > scala> val show2 = sql("SHOW TABLES IN ns2") > show2: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string > ... 1 more field] > scala> show1.show > +-+-+---+ > |namespace|tableName|isTemporary| > +-+-+---+ > | ns1| tbl1| false| > +-+-+---+ > scala> show2.show > +-+-+---+ > |namespace|tableName|isTemporary| > +-+-+---+ > | ns2| tbl2| false| > +-+-+---+ > scala> show1.join(show2).where(show1("tableName") =!= show2("tableName")).show > org.apache.spark.sql.AnalysisException: Column tableName#17 are ambiguous. > It's probably because you joined several Datasets together, and some of these > Datasets are the same. This column points to one of the Datasets but Spark is > unable to figure out which one. Please alias the Datasets with different > names via `Dataset.as` before joining them, and specify the column using > qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You > can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable > this check. > at > org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34560) Cannot join datasets of SHOW TABLES
[ https://issues.apache.org/jira/browse/SPARK-34560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34560: Assignee: Apache Spark > Cannot join datasets of SHOW TABLES > --- > > Key: SPARK-34560 > URL: https://issues.apache.org/jira/browse/SPARK-34560 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The example portraits the issue: > {code:scala} > scala> sql("CREATE NAMESPACE ns1") > res8: org.apache.spark.sql.DataFrame = [] > scala> sql("CREATE NAMESPACE ns2") > res9: org.apache.spark.sql.DataFrame = [] > scala> sql("CREATE TABLE ns1.tbl1 (c INT)") > res10: org.apache.spark.sql.DataFrame = [] > scala> sql("CREATE TABLE ns2.tbl2 (c INT)") > res11: org.apache.spark.sql.DataFrame = [] > scala> val show1 = sql("SHOW TABLES IN ns1") > show1: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string > ... 1 more field] > scala> val show2 = sql("SHOW TABLES IN ns2") > show2: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string > ... 1 more field] > scala> show1.show > +-+-+---+ > |namespace|tableName|isTemporary| > +-+-+---+ > | ns1| tbl1| false| > +-+-+---+ > scala> show2.show > +-+-+---+ > |namespace|tableName|isTemporary| > +-+-+---+ > | ns2| tbl2| false| > +-+-+---+ > scala> show1.join(show2).where(show1("tableName") =!= show2("tableName")).show > org.apache.spark.sql.AnalysisException: Column tableName#17 are ambiguous. > It's probably because you joined several Datasets together, and some of these > Datasets are the same. This column points to one of the Datasets but Spark is > unable to figure out which one. Please alias the Datasets with different > names via `Dataset.as` before joining them, and specify the column using > qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You > can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable > this check. > at > org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34560) Cannot join datasets of SHOW TABLES
Maxim Gekk created SPARK-34560: -- Summary: Cannot join datasets of SHOW TABLES Key: SPARK-34560 URL: https://issues.apache.org/jira/browse/SPARK-34560 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk The example portraits the issue: {code:scala} scala> sql("CREATE NAMESPACE ns1") res8: org.apache.spark.sql.DataFrame = [] scala> sql("CREATE NAMESPACE ns2") res9: org.apache.spark.sql.DataFrame = [] scala> sql("CREATE TABLE ns1.tbl1 (c INT)") res10: org.apache.spark.sql.DataFrame = [] scala> sql("CREATE TABLE ns2.tbl2 (c INT)") res11: org.apache.spark.sql.DataFrame = [] scala> val show1 = sql("SHOW TABLES IN ns1") show1: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string ... 1 more field] scala> val show2 = sql("SHOW TABLES IN ns2") show2: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string ... 1 more field] scala> show1.show +-+-+---+ |namespace|tableName|isTemporary| +-+-+---+ | ns1| tbl1| false| +-+-+---+ scala> show2.show +-+-+---+ |namespace|tableName|isTemporary| +-+-+---+ | ns2| tbl2| false| +-+-+---+ scala> show1.join(show2).where(show1("tableName") =!= show2("tableName")).show org.apache.spark.sql.AnalysisException: Column tableName#17 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check. at org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34559) Upgrade to ZSTD JNI 1.4.8-6
[ https://issues.apache.org/jira/browse/SPARK-34559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34559. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31674 [https://github.com/apache/spark/pull/31674] > Upgrade to ZSTD JNI 1.4.8-6 > --- > > Key: SPARK-34559 > URL: https://issues.apache.org/jira/browse/SPARK-34559 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34559) Upgrade to ZSTD JNI 1.4.8-6
[ https://issues.apache.org/jira/browse/SPARK-34559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34559: - Assignee: Dongjoon Hyun > Upgrade to ZSTD JNI 1.4.8-6 > --- > > Key: SPARK-34559 > URL: https://issues.apache.org/jira/browse/SPARK-34559 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34447) Refactor the unified v1 and v2 command tests
[ https://issues.apache.org/jira/browse/SPARK-34447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-34447: --- Description: The ticket aims to gather potential improvements for the unified tests. 1. Remove SharedSparkSession from *ParserSuite 2. Rename tests like AlterTableAddPartitionSuite -> AddPartitionsSuite 3. Add JIRA ID SPARK-33829 to "SPARK-33786: Cache's storage level should be respected when a table name is altered" 4. Reset default namespace in ShowTablesSuiteBase."change current catalog and namespace with USE statements" using spark.sessionState.catalogManager.reset() was: The ticket aims to gather potential improvements for the unified tests. 1. Remove SharedSparkSession from *ParserSuite 2. Rename tests like AlterTableAddPartitionSuite -> AddPartitionsSuite 3. Add JIRA ID SPARK-33829 to "SPARK-33786: Cache's storage level should be respected when a table name is altered" > Refactor the unified v1 and v2 command tests > > > Key: SPARK-34447 > URL: https://issues.apache.org/jira/browse/SPARK-34447 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Minor > > The ticket aims to gather potential improvements for the unified tests. > 1. Remove SharedSparkSession from *ParserSuite > 2. Rename tests like AlterTableAddPartitionSuite -> AddPartitionsSuite > 3. Add JIRA ID SPARK-33829 to "SPARK-33786: Cache's storage level should be > respected when a table name is altered" > 4. Reset default namespace in ShowTablesSuiteBase."change current catalog > and namespace with USE statements" using > spark.sessionState.catalogManager.reset() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34559) Upgrade to ZSTD JNI 1.4.8-6
[ https://issues.apache.org/jira/browse/SPARK-34559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34559: -- Summary: Upgrade to ZSTD JNI 1.4.8-6 (was: Upgrade to ZSTD JNI 1.4.6) > Upgrade to ZSTD JNI 1.4.8-6 > --- > > Key: SPARK-34559 > URL: https://issues.apache.org/jira/browse/SPARK-34559 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34559) Upgrade to ZSTD JNI 1.4.6
[ https://issues.apache.org/jira/browse/SPARK-34559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34559: Assignee: Apache Spark > Upgrade to ZSTD JNI 1.4.6 > - > > Key: SPARK-34559 > URL: https://issues.apache.org/jira/browse/SPARK-34559 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34559) Upgrade to ZSTD JNI 1.4.6
[ https://issues.apache.org/jira/browse/SPARK-34559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292076#comment-17292076 ] Apache Spark commented on SPARK-34559: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/31674 > Upgrade to ZSTD JNI 1.4.6 > - > > Key: SPARK-34559 > URL: https://issues.apache.org/jira/browse/SPARK-34559 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34559) Upgrade to ZSTD JNI 1.4.6
[ https://issues.apache.org/jira/browse/SPARK-34559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34559: Assignee: (was: Apache Spark) > Upgrade to ZSTD JNI 1.4.6 > - > > Key: SPARK-34559 > URL: https://issues.apache.org/jira/browse/SPARK-34559 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34559) Upgrade to ZSTD JNI 1.4.6
Dongjoon Hyun created SPARK-34559: - Summary: Upgrade to ZSTD JNI 1.4.6 Key: SPARK-34559 URL: https://issues.apache.org/jira/browse/SPARK-34559 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.2.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34557) Exclude Avro's transitive zstd-jni dependency
[ https://issues.apache.org/jira/browse/SPARK-34557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34557: - Assignee: Dongjoon Hyun > Exclude Avro's transitive zstd-jni dependency > - > > Key: SPARK-34557 > URL: https://issues.apache.org/jira/browse/SPARK-34557 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34557) Exclude Avro's transitive zstd-jni dependency
[ https://issues.apache.org/jira/browse/SPARK-34557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34557. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31670 [https://github.com/apache/spark/pull/31670] > Exclude Avro's transitive zstd-jni dependency > - > > Key: SPARK-34557 > URL: https://issues.apache.org/jira/browse/SPARK-34557 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org