[jira] [Assigned] (SPARK-35722) wait until something does get queued
[ https://issues.apache.org/jira/browse/SPARK-35722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35722: Assignee: (was: Apache Spark) > wait until something does get queued > > > Key: SPARK-35722 > URL: https://issues.apache.org/jira/browse/SPARK-35722 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: yikf >Priority: Minor > Fix For: 3.2.0 > > > if nothing has been added to the queue, should wait until something does get > queued instead of loop after timeout. > It prevents an ineffective cycle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35722) wait until something does get queued
[ https://issues.apache.org/jira/browse/SPARK-35722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35722: Assignee: Apache Spark > wait until something does get queued > > > Key: SPARK-35722 > URL: https://issues.apache.org/jira/browse/SPARK-35722 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: yikf >Assignee: Apache Spark >Priority: Minor > Fix For: 3.2.0 > > > if nothing has been added to the queue, should wait until something does get > queued instead of loop after timeout. > It prevents an ineffective cycle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35722) wait until something does get queued
[ https://issues.apache.org/jira/browse/SPARK-35722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361416#comment-17361416 ] Apache Spark commented on SPARK-35722: -- User 'yikf' has created a pull request for this issue: https://github.com/apache/spark/pull/32876 > wait until something does get queued > > > Key: SPARK-35722 > URL: https://issues.apache.org/jira/browse/SPARK-35722 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: yikf >Priority: Minor > Fix For: 3.2.0 > > > if nothing has been added to the queue, should wait until something does get > queued instead of loop after timeout. > It prevents an ineffective cycle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35718) Support casting of Date to timestamp without time zone type
[ https://issues.apache.org/jira/browse/SPARK-35718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35718. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32873 [https://github.com/apache/spark/pull/32873] > Support casting of Date to timestamp without time zone type > --- > > Key: SPARK-35718 > URL: https://issues.apache.org/jira/browse/SPARK-35718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.2.0 > > > Extend the Cast expression and support DateType in casting to > TimestampWithoutTZType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths
[ https://issues.apache.org/jira/browse/SPARK-35640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-35640. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32777 [https://github.com/apache/spark/pull/32777] > Refactor Parquet vectorized reader to remove duplicated code paths > -- > > Key: SPARK-35640 > URL: https://issues.apache.org/jira/browse/SPARK-35640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > Currently in Parquet vectorized code path, there are many code duplications > such as the following: > {code:java} > public void readIntegers( > int total, > WritableColumnVector c, > int rowId, > int level, > VectorizedValuesReader data) throws IOException { > int left = total; > while (left > 0) { > if (this.currentCount == 0) this.readNextGroup(); > int n = Math.min(left, this.currentCount); > switch (mode) { > case RLE: > if (currentValue == level) { > data.readIntegers(n, c, rowId); > } else { > c.putNulls(rowId, n); > } > break; > case PACKED: > for (int i = 0; i < n; ++i) { > if (currentBuffer[currentBufferIdx++] == level) { > c.putInt(rowId + i, data.readInteger()); > } else { > c.putNull(rowId + i); > } > } > break; > } > rowId += n; > left -= n; > currentCount -= n; > } > } > {code} > This makes it hard to maintain as any change on this will need to be > replicated in 20+ places. The issue becomes more serious when we are going to > implement column index and complex type support for the vectorized path. > The original intention is for performance. However now days JIT compilers > tend to be smart on this and will inline virtual calls as much as possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35722) wait until something does get queued
[ https://issues.apache.org/jira/browse/SPARK-35722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yikf updated SPARK-35722: - Summary: wait until something does get queued (was: wait until until something does get queued) > wait until something does get queued > > > Key: SPARK-35722 > URL: https://issues.apache.org/jira/browse/SPARK-35722 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: yikf >Priority: Minor > Fix For: 3.2.0 > > > if nothing has been added to the queue, should wait until something does get > queued instead of loop after timeout. > It prevents an ineffective cycle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35721) Path level discover for python unittests
[ https://issues.apache.org/jira/browse/SPARK-35721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35721: Assignee: Apache Spark > Path level discover for python unittests > > > Key: SPARK-35721 > URL: https://issues.apache.org/jira/browse/SPARK-35721 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > > Now we need to specify the python test cases by manually when we add a new > testcase. Sometime, we forgot to add the testcase to module list, the > testcase would not be executed. > Such as: > * pyspark-core pyspark.tests.test_pin_thread > Thus we need some auto-discover way to find all testcase rather than > specified every case by manually. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35722) wait until until something does get queued
yikf created SPARK-35722: Summary: wait until until something does get queued Key: SPARK-35722 URL: https://issues.apache.org/jira/browse/SPARK-35722 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: yikf Fix For: 3.2.0 if nothing has been added to the queue, should wait until something does get queued instead of loop after timeout. It prevents an ineffective cycle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35721) Path level discover for python unittests
[ https://issues.apache.org/jira/browse/SPARK-35721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35721: Assignee: (was: Apache Spark) > Path level discover for python unittests > > > Key: SPARK-35721 > URL: https://issues.apache.org/jira/browse/SPARK-35721 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Major > > Now we need to specify the python test cases by manually when we add a new > testcase. Sometime, we forgot to add the testcase to module list, the > testcase would not be executed. > Such as: > * pyspark-core pyspark.tests.test_pin_thread > Thus we need some auto-discover way to find all testcase rather than > specified every case by manually. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35721) Path level discover for python unittests
[ https://issues.apache.org/jira/browse/SPARK-35721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361397#comment-17361397 ] Apache Spark commented on SPARK-35721: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/32867 > Path level discover for python unittests > > > Key: SPARK-35721 > URL: https://issues.apache.org/jira/browse/SPARK-35721 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Major > > Now we need to specify the python test cases by manually when we add a new > testcase. Sometime, we forgot to add the testcase to module list, the > testcase would not be executed. > Such as: > * pyspark-core pyspark.tests.test_pin_thread > Thus we need some auto-discover way to find all testcase rather than > specified every case by manually. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35721) Path level discover for python unittests
Yikun Jiang created SPARK-35721: --- Summary: Path level discover for python unittests Key: SPARK-35721 URL: https://issues.apache.org/jira/browse/SPARK-35721 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 3.2.0 Reporter: Yikun Jiang Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed. Such as: * pyspark-core pyspark.tests.test_pin_thread Thus we need some auto-discover way to find all testcase rather than specified every case by manually. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34512) Disable validate default values when parsing Avro schemas
[ https://issues.apache.org/jira/browse/SPARK-34512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34512. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32750 [https://github.com/apache/spark/pull/32750] > Disable validate default values when parsing Avro schemas > - > > Key: SPARK-34512 > URL: https://issues.apache.org/jira/browse/SPARK-34512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > This is a regression problem. How to reproduce this issue: > {code:scala} > // Add this test to HiveSerDeReadWriteSuite > test("SPARK-34512") { > withTable("t1") { > hiveClient.runSqlHive( > """ > |CREATE TABLE t1 > | ROW FORMAT SERDE > | 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > | STORED AS INPUTFORMAT > | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > | OUTPUTFORMAT > | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > | TBLPROPERTIES ( > |'avro.schema.literal'='{ > | "namespace": "org.apache.spark.sql.hive.test", > | "name": "schema_with_default_value", > | "type": "record", > | "fields": [ > | { > | "name": "ARRAY_WITH_DEFAULT", > | "type": {"type": "array", "items": "string"}, > | "default": null > | } > | ] > |}') > |""".stripMargin) > spark.sql("select * from t1").show > } > } > {code} > {noformat} > org.apache.avro.AvroTypeException: Invalid default for field > ARRAY_WITH_DEFAULT: null not a {"type":"array","items":"string"} > at org.apache.avro.Schema.validateDefault(Schema.java:1571) > at org.apache.avro.Schema.access$500(Schema.java:87) > at org.apache.avro.Schema$Field.(Schema.java:544) > at org.apache.avro.Schema.parse(Schema.java:1678) > at org.apache.avro.Schema$Parser.parse(Schema.java:1425) > at org.apache.avro.Schema$Parser.parse(Schema.java:1413) > at > org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFor(AvroSerdeUtils.java:268) > at > org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:111) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:187) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:107) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83) > at > org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:450) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:437) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:263) > at > org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:641) > at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:624) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:831) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:867) > at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:820) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223) > at >
[jira] [Assigned] (SPARK-35703) Remove HashClusteredDistribution
[ https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35703: Assignee: Apache Spark > Remove HashClusteredDistribution > > > Key: SPARK-35703 > URL: https://issues.apache.org/jira/browse/SPARK-35703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Major > > Currently Spark has {{HashClusteredDistribution}} and > {{ClusteredDistribution}}. The only difference between the two is that the > former is more strict when deciding whether bucket join is allowed to avoid > shuffle: comparing to the latter, it requires *exact* match between the > clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and > the join keys. However, this is unnecessary, as we should be able to avoid > shuffle when the set of clustering keys is a subset of join keys, just like > {{ClusteredDistribution}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35703) Remove HashClusteredDistribution
[ https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361369#comment-17361369 ] Apache Spark commented on SPARK-35703: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/32875 > Remove HashClusteredDistribution > > > Key: SPARK-35703 > URL: https://issues.apache.org/jira/browse/SPARK-35703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark has {{HashClusteredDistribution}} and > {{ClusteredDistribution}}. The only difference between the two is that the > former is more strict when deciding whether bucket join is allowed to avoid > shuffle: comparing to the latter, it requires *exact* match between the > clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and > the join keys. However, this is unnecessary, as we should be able to avoid > shuffle when the set of clustering keys is a subset of join keys, just like > {{ClusteredDistribution}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35703) Remove HashClusteredDistribution
[ https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35703: Assignee: (was: Apache Spark) > Remove HashClusteredDistribution > > > Key: SPARK-35703 > URL: https://issues.apache.org/jira/browse/SPARK-35703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark has {{HashClusteredDistribution}} and > {{ClusteredDistribution}}. The only difference between the two is that the > former is more strict when deciding whether bucket join is allowed to avoid > shuffle: comparing to the latter, it requires *exact* match between the > clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and > the join keys. However, this is unnecessary, as we should be able to avoid > shuffle when the set of clustering keys is a subset of join keys, just like > {{ClusteredDistribution}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34512) Disable validate default values when parsing Avro schemas
[ https://issues.apache.org/jira/browse/SPARK-34512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34512: - Assignee: Yuming Wang > Disable validate default values when parsing Avro schemas > - > > Key: SPARK-34512 > URL: https://issues.apache.org/jira/browse/SPARK-34512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > This is a regression problem. How to reproduce this issue: > {code:scala} > // Add this test to HiveSerDeReadWriteSuite > test("SPARK-34512") { > withTable("t1") { > hiveClient.runSqlHive( > """ > |CREATE TABLE t1 > | ROW FORMAT SERDE > | 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > | STORED AS INPUTFORMAT > | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > | OUTPUTFORMAT > | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > | TBLPROPERTIES ( > |'avro.schema.literal'='{ > | "namespace": "org.apache.spark.sql.hive.test", > | "name": "schema_with_default_value", > | "type": "record", > | "fields": [ > | { > | "name": "ARRAY_WITH_DEFAULT", > | "type": {"type": "array", "items": "string"}, > | "default": null > | } > | ] > |}') > |""".stripMargin) > spark.sql("select * from t1").show > } > } > {code} > {noformat} > org.apache.avro.AvroTypeException: Invalid default for field > ARRAY_WITH_DEFAULT: null not a {"type":"array","items":"string"} > at org.apache.avro.Schema.validateDefault(Schema.java:1571) > at org.apache.avro.Schema.access$500(Schema.java:87) > at org.apache.avro.Schema$Field.(Schema.java:544) > at org.apache.avro.Schema.parse(Schema.java:1678) > at org.apache.avro.Schema$Parser.parse(Schema.java:1425) > at org.apache.avro.Schema$Parser.parse(Schema.java:1413) > at > org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFor(AvroSerdeUtils.java:268) > at > org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:111) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:187) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:107) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83) > at > org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:450) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:437) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:263) > at > org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:641) > at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:624) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:831) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:867) > at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:820) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:800) > at >
[jira] [Commented] (SPARK-35703) Remove HashClusteredDistribution
[ https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361371#comment-17361371 ] Apache Spark commented on SPARK-35703: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/32875 > Remove HashClusteredDistribution > > > Key: SPARK-35703 > URL: https://issues.apache.org/jira/browse/SPARK-35703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark has {{HashClusteredDistribution}} and > {{ClusteredDistribution}}. The only difference between the two is that the > former is more strict when deciding whether bucket join is allowed to avoid > shuffle: comparing to the latter, it requires *exact* match between the > clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and > the join keys. However, this is unnecessary, as we should be able to avoid > shuffle when the set of clustering keys is a subset of join keys, just like > {{ClusteredDistribution}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35623) Volcano resource manager for Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-35623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361368#comment-17361368 ] Kevin Su commented on SPARK-35623: -- [~dipanjanK] Thanks for proposing this feature. I'm interested to contribute this feature. > Volcano resource manager for Spark on Kubernetes > > > Key: SPARK-35623 > URL: https://issues.apache.org/jira/browse/SPARK-35623 > Project: Spark > Issue Type: Brainstorming > Components: Kubernetes >Affects Versions: 3.1.1, 3.1.2 >Reporter: Dipanjan Kailthya >Priority: Minor > Labels: kubernetes, resourcemanager > > Dear Spark Developers, > > Hello from the Netherlands! Posting this here as I still haven't gotten > accepted to post in the spark dev mailing list. > > My team is planning to use spark with Kubernetes support on our shared > (multi-tenant) on premise Kubernetes cluster. However we would like to have > certain scheduling features like fair-share and preemption which as we > understand are not built into the current spark-kubernetes resource manager > yet. We have been working on and are close to a first successful prototype > integration with Volcano ([https://volcano.sh/en/docs/]). Briefly this means > a new resource manager component with lots in common with existing > spark-kubernetes resource manager, but instead of pods it launches Volcano > jobs which delegate the driver and executor pod creation and lifecycle > management to Volcano. We are interested in contributing this to open source, > either directly in spark or as a separate project. > > So, two questions: > > 1. Do the spark maintainers see this as a valuable contribution to the > mainline spark codebase? If so, can we have some guidance on how to publish > the changes? > > 2. Are any other developers / organizations interested to contribute to this > effort? If so, please get in touch. > > Best, > Dipanjan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35699) Improve error message when creating k8s pod failed.
[ https://issues.apache.org/jira/browse/SPARK-35699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35699: Assignee: (was: Apache Spark) > Improve error message when creating k8s pod failed. > --- > > Key: SPARK-35699 > URL: https://issues.apache.org/jira/browse/SPARK-35699 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.2 >Reporter: Kevin Su >Priority: Minor > > > If I use wrong k8s master URL, I will get the below error message. > I think we could improve error message, so end-users can get a better > understanding of the message. > {code:java} > (base) ➜ spark git:(master) ./bin/spark-submit \ > --master k8s://https://192.168.49.3:8443 \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=3 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.kubernetes.container.image=pingsutw/spark:testing \ > local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar > 21/06/09 20:50:37 WARN Utils: Your hostname, kobe-pc resolves to a loopback > address: 127.0.1.1; using 192.168.103.20 instead (on interface ens160) > 21/06/09 20:50:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 21/06/09 20:50:38 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 21/06/09 20:50:38 INFO SparkKubernetesClientFactory: Auto-configuring K8S > client using current context from users K8S config file > 21/06/09 20:50:39 INFO KerberosConfDriverFeatureStep: You have not specified > a krb5.conf file locally or via a ConfigMap. Make sure that you have the > krb5.conf locally on the driver image. > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] > for kind: [Pod] with name: [null] in namespace: [default] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35699) Improve error message when creating k8s pod failed.
[ https://issues.apache.org/jira/browse/SPARK-35699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35699: Assignee: Apache Spark > Improve error message when creating k8s pod failed. > --- > > Key: SPARK-35699 > URL: https://issues.apache.org/jira/browse/SPARK-35699 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.2 >Reporter: Kevin Su >Assignee: Apache Spark >Priority: Minor > > > If I use wrong k8s master URL, I will get the below error message. > I think we could improve error message, so end-users can get a better > understanding of the message. > {code:java} > (base) ➜ spark git:(master) ./bin/spark-submit \ > --master k8s://https://192.168.49.3:8443 \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=3 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.kubernetes.container.image=pingsutw/spark:testing \ > local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar > 21/06/09 20:50:37 WARN Utils: Your hostname, kobe-pc resolves to a loopback > address: 127.0.1.1; using 192.168.103.20 instead (on interface ens160) > 21/06/09 20:50:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 21/06/09 20:50:38 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 21/06/09 20:50:38 INFO SparkKubernetesClientFactory: Auto-configuring K8S > client using current context from users K8S config file > 21/06/09 20:50:39 INFO KerberosConfDriverFeatureStep: You have not specified > a krb5.conf file locally or via a ConfigMap. Make sure that you have the > krb5.conf locally on the driver image. > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] > for kind: [Pod] with name: [null] in namespace: [default] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35699) Improve error message when creating k8s pod failed.
[ https://issues.apache.org/jira/browse/SPARK-35699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361366#comment-17361366 ] Apache Spark commented on SPARK-35699: -- User 'pingsutw' has created a pull request for this issue: https://github.com/apache/spark/pull/32874 > Improve error message when creating k8s pod failed. > --- > > Key: SPARK-35699 > URL: https://issues.apache.org/jira/browse/SPARK-35699 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.2 >Reporter: Kevin Su >Priority: Minor > > > If I use wrong k8s master URL, I will get the below error message. > I think we could improve error message, so end-users can get a better > understanding of the message. > {code:java} > (base) ➜ spark git:(master) ./bin/spark-submit \ > --master k8s://https://192.168.49.3:8443 \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=3 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.kubernetes.container.image=pingsutw/spark:testing \ > local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar > 21/06/09 20:50:37 WARN Utils: Your hostname, kobe-pc resolves to a loopback > address: 127.0.1.1; using 192.168.103.20 instead (on interface ens160) > 21/06/09 20:50:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 21/06/09 20:50:38 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 21/06/09 20:50:38 INFO SparkKubernetesClientFactory: Auto-configuring K8S > client using current context from users K8S config file > 21/06/09 20:50:39 INFO KerberosConfDriverFeatureStep: You have not specified > a krb5.conf file locally or via a ConfigMap. Make sure that you have the > krb5.conf locally on the driver image. > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] > for kind: [Pod] with name: [null] in namespace: [default] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35699) Improve error message when creating k8s pod failed.
[ https://issues.apache.org/jira/browse/SPARK-35699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361363#comment-17361363 ] Apache Spark commented on SPARK-35699: -- User 'pingsutw' has created a pull request for this issue: https://github.com/apache/spark/pull/32874 > Improve error message when creating k8s pod failed. > --- > > Key: SPARK-35699 > URL: https://issues.apache.org/jira/browse/SPARK-35699 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.2 >Reporter: Kevin Su >Priority: Minor > > > If I use wrong k8s master URL, I will get the below error message. > I think we could improve error message, so end-users can get a better > understanding of the message. > {code:java} > (base) ➜ spark git:(master) ./bin/spark-submit \ > --master k8s://https://192.168.49.3:8443 \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=3 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.kubernetes.container.image=pingsutw/spark:testing \ > local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar > 21/06/09 20:50:37 WARN Utils: Your hostname, kobe-pc resolves to a loopback > address: 127.0.1.1; using 192.168.103.20 instead (on interface ens160) > 21/06/09 20:50:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 21/06/09 20:50:38 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 21/06/09 20:50:38 INFO SparkKubernetesClientFactory: Auto-configuring K8S > client using current context from users K8S config file > 21/06/09 20:50:39 INFO KerberosConfDriverFeatureStep: You have not specified > a krb5.conf file locally or via a ConfigMap. Make sure that you have the > krb5.conf locally on the driver image. > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] > for kind: [Pod] with name: [null] in namespace: [default] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35056) Group exception messages in execution/streaming
[ https://issues.apache.org/jira/browse/SPARK-35056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361362#comment-17361362 ] jiaan.geng commented on SPARK-35056: I'm working on. > Group exception messages in execution/streaming > --- > > Key: SPARK-35056 > URL: https://issues.apache.org/jira/browse/SPARK-35056 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > 'sql/core/src/main/scala/org/apache/spark/sql/execution/streaming' > || Filename || Count || > | CheckpointFileManager.scala | 3 | > | CommitLog.scala | 1 | > | CompactibleFileStreamLog.scala | 2 | > | FileStreamOptions.scala | 1 | > | FileStreamSink.scala | 1 | > | GroupStateImpl.scala | 5 | > | HDFSMetadataLog.scala| 4 | > | ManifestFileCommitProtocol.scala | 1 | > | MicroBatchExecution.scala| 5 | > | OffsetSeqLog.scala | 1 | > | Sink.scala | 3 | > | Source.scala | 3 | > | StreamingQueryWrapper.scala | 1 | > | StreamingRelation.scala | 1 | > | StreamingSymmetricHashJoinExec.scala | 2 | > | Triggers.scala | 1 | > | WatermarkTracker.scala | 1 | > | memory.scala | 4 | > | statefulOperators.scala | 2 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35718) Support casting of Date to timestamp without time zone type
[ https://issues.apache.org/jira/browse/SPARK-35718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35718: Assignee: Gengliang Wang (was: Apache Spark) > Support casting of Date to timestamp without time zone type > --- > > Key: SPARK-35718 > URL: https://issues.apache.org/jira/browse/SPARK-35718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Extend the Cast expression and support DateType in casting to > TimestampWithoutTZType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35718) Support casting of Date to timestamp without time zone type
[ https://issues.apache.org/jira/browse/SPARK-35718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35718: Assignee: Apache Spark (was: Gengliang Wang) > Support casting of Date to timestamp without time zone type > --- > > Key: SPARK-35718 > URL: https://issues.apache.org/jira/browse/SPARK-35718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Extend the Cast expression and support DateType in casting to > TimestampWithoutTZType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35718) Support casting of Date to timestamp without time zone type
[ https://issues.apache.org/jira/browse/SPARK-35718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361352#comment-17361352 ] Apache Spark commented on SPARK-35718: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/32873 > Support casting of Date to timestamp without time zone type > --- > > Key: SPARK-35718 > URL: https://issues.apache.org/jira/browse/SPARK-35718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Extend the Cast expression and support DateType in casting to > TimestampWithoutTZType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35718) Support casting of Date to timestamp without time zone type
[ https://issues.apache.org/jira/browse/SPARK-35718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361353#comment-17361353 ] Apache Spark commented on SPARK-35718: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/32873 > Support casting of Date to timestamp without time zone type > --- > > Key: SPARK-35718 > URL: https://issues.apache.org/jira/browse/SPARK-35718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Extend the Cast expression and support DateType in casting to > TimestampWithoutTZType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35639) Add metrics about coalesced partitions to CustomShuffleReader in AQE
[ https://issues.apache.org/jira/browse/SPARK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361351#comment-17361351 ] Apache Spark commented on SPARK-35639: -- User 'ekoifman' has created a pull request for this issue: https://github.com/apache/spark/pull/32872 > Add metrics about coalesced partitions to CustomShuffleReader in AQE > > > Key: SPARK-35639 > URL: https://issues.apache.org/jira/browse/SPARK-35639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Eugene Koifman >Priority: Major > > {{CustomShuffleReaderExec}} reports "number of skewed partitions" and "number > of skewed partition splits". > It would be useful to also report "number of partitions to coalesce" and > "number of coalesced partitions" and include this in string rendering of the > SparkPlan node so that it looks like this > {code:java} > (12) CustomShuffleReader > Input [2]: [a#23, b#24] > Arguments: coalesced 3 partitions into 1 and split 2 skewed partitions into 4 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35720) Support casting of String to timestamp without time zone type
Gengliang Wang created SPARK-35720: -- Summary: Support casting of String to timestamp without time zone type Key: SPARK-35720 URL: https://issues.apache.org/jira/browse/SPARK-35720 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Extend the Cast expression and support in casting StringType toTimestampWithoutTZType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35719) Support casting of timestamp to timestamp without time zone type
Gengliang Wang created SPARK-35719: -- Summary: Support casting of timestamp to timestamp without time zone type Key: SPARK-35719 URL: https://issues.apache.org/jira/browse/SPARK-35719 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Extend the Cast expression and support TimestampType in casting to TimestampWithoutTZType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35718) Support casting of Date to timestamp without time zone type
Gengliang Wang created SPARK-35718: -- Summary: Support casting of Date to timestamp without time zone type Key: SPARK-35718 URL: https://issues.apache.org/jira/browse/SPARK-35718 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Extend the Cast expression and support DateType in casting to TimestampWithoutTZType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35475) Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace.
[ https://issues.apache.org/jira/browse/SPARK-35475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361311#comment-17361311 ] Apache Spark commented on SPARK-35475: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/32871 > Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace. > - > > Key: SPARK-35475 > URL: https://issues.apache.org/jira/browse/SPARK-35475 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361310#comment-17361310 ] Jungtaek Lim commented on SPARK-24295: -- [~parshabani] Please follow up SPARK-27188. But there're lots of other problems in file stream source/sink, and they require major efforts to deal with. 3rd party data lake solutions (e.g. Apache Hudi, Apache Iceberg, Delta Lake sorted by alphabetially) are trying to deal with these problems (and they already resolved known issues) so you may want to look into these solutions. > Purge Structured streaming FileStreamSinkLog metadata compact file data. > > > Key: SPARK-24295 > URL: https://issues.apache.org/jira/browse/SPARK-24295 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Iqbal Singh >Priority: Major > Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz > > > FileStreamSinkLog metadata logs are concatenated to a single compact file > after defined compact interval. > For long running jobs, compact file size can grow up to 10's of GB's, Causing > slowness while reading the data from FileStreamSinkLog dir as spark is > defaulting to the "__spark__metadata" dir for the read. > We need a functionality to purge the compact file size. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35475) Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace.
[ https://issues.apache.org/jira/browse/SPARK-35475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361309#comment-17361309 ] Apache Spark commented on SPARK-35475: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/32871 > Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace. > - > > Key: SPARK-35475 > URL: https://issues.apache.org/jira/browse/SPARK-35475 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35475) Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace.
[ https://issues.apache.org/jira/browse/SPARK-35475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35475: Assignee: Apache Spark > Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace. > - > > Key: SPARK-35475 > URL: https://issues.apache.org/jira/browse/SPARK-35475 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35475) Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace.
[ https://issues.apache.org/jira/browse/SPARK-35475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35475: Assignee: (was: Apache Spark) > Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace. > - > > Key: SPARK-35475 > URL: https://issues.apache.org/jira/browse/SPARK-35475 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35709) Remove links to Nomad integration in the Documentation
[ https://issues.apache.org/jira/browse/SPARK-35709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-35709. -- Fix Version/s: 3.2.0 Assignee: Paweł Ptaszyński Resolution: Fixed Resolved by https://github.com/apache/spark/pull/32860 > Remove links to Nomad integration in the Documentation > -- > > Key: SPARK-35709 > URL: https://issues.apache.org/jira/browse/SPARK-35709 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2 >Reporter: Paweł Ptaszyński >Assignee: Paweł Ptaszyński >Priority: Trivial > Fix For: 3.2.0 > > > In Project documentation `docs/cluster-overview.md` there is still active > reference to the [Nomad integration project by > Hashicorp|https://github.com/hashicorp/nomad-spark]. > As per the repository project is no longer active. The task is to remove the > reference from the Documentation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361298#comment-17361298 ] Apache Spark commented on SPARK-35439: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/32870 > Children subexpr should come first than parent subexpr in subexpression > elimination > --- > > Key: SPARK-35439 > URL: https://issues.apache.org/jira/browse/SPARK-35439 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > EquivalentExpressions maintains a map of equivalent expressions. It is > HashMap now so the insertion order is not guaranteed to be preserved later. > Subexpression elimination relies on retrieving subexpressions from the map. > If there is child-parent relationships among the subexpressions, we want the > child expressions come first than parent expressions, so we can replace child > expressions in parent expressions with subexpression evaluation. > For example, we have two different expressions Add(Literal(1), Literal(2)) > and Add(Literal(3), add). > Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can > deal with it. > addExprTree(add) > addExprTree(Add(Literal(3), add)) > addExprTree(Add(Literal(3), add)) > Case 2: parent subexpr comes first. For this case, we need to sort equivalent > expressions. > addExprTree(Add(Literal(3), add)) => We add `Add(Literal(3), add)` into the > map first, then add `add` into the map > addExprTree(add) > addExprTree(Add(Literal(3), add)) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35717) pandas_udf crashes in conjunction with .filter()
F. H. created SPARK-35717: - Summary: pandas_udf crashes in conjunction with .filter() Key: SPARK-35717 URL: https://issues.apache.org/jira/browse/SPARK-35717 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.1.2, 3.1.1, 3.0.0 Environment: Centos 8 with PySpark from conda Reporter: F. H. I wrote the following UDF that always returns some "byte"-type array: {code:python} from typing import Iterator @f.pandas_udf(returnType=t.ByteType()) def spark_gt_mapping_fn(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]: mapping = dict() mapping[(-1, -1)] = -1 mapping[(0, 0)] = 0 mapping[(0, 1)] = 1 mapping[(1, 0)] = 1 mapping[(1, 1)] = 2 def gt_mapping_fn(v): if len(v) != 2: return -3 else: a, b = v return mapping.get((a, b), -2) for x in batch_iter: yield x.apply(gt_mapping_fn).astype("int8") {code} However, every time I'd like to filter on the resulting column, I get the following error: {code:python} # works: ( df .select(spark_gt_mapping_fn(f.col("genotype.calls")).alias("GT")) .limit(10).toPandas() ) # fails: ( df .select(spark_gt_mapping_fn(f.col("genotype.calls")).alias("GT")) .filter("GT > 0") .limit(10).toPandas() ) {code} {code:java} Py4JJavaError: An error occurred while calling o672.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 125) (ouga05.cmm.in.tum.de executor driver): org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by query. Memory leaked: (16384) Allocator(stdin reader for python3) 0/16384/34816/9223372036854775807 (res/actual/peak/limit) at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:145) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124) at org.apache.spark.scheduler.Task.run(Task.scala:147) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:472) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:425) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at
[jira] [Commented] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361297#comment-17361297 ] Apache Spark commented on SPARK-35439: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/32870 > Children subexpr should come first than parent subexpr in subexpression > elimination > --- > > Key: SPARK-35439 > URL: https://issues.apache.org/jira/browse/SPARK-35439 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > EquivalentExpressions maintains a map of equivalent expressions. It is > HashMap now so the insertion order is not guaranteed to be preserved later. > Subexpression elimination relies on retrieving subexpressions from the map. > If there is child-parent relationships among the subexpressions, we want the > child expressions come first than parent expressions, so we can replace child > expressions in parent expressions with subexpression evaluation. > For example, we have two different expressions Add(Literal(1), Literal(2)) > and Add(Literal(3), add). > Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can > deal with it. > addExprTree(add) > addExprTree(Add(Literal(3), add)) > addExprTree(Add(Literal(3), add)) > Case 2: parent subexpr comes first. For this case, we need to sort equivalent > expressions. > addExprTree(Add(Literal(3), add)) => We add `Add(Literal(3), add)` into the > map first, then add `add` into the map > addExprTree(add) > addExprTree(Add(Literal(3), add)) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35616) Make astype data-type-based
[ https://issues.apache.org/jira/browse/SPARK-35616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35616: - Description: There are many type checks in `astype` methods. Since `DataTypeOps` is introduced, we should refactor `astype` to make it data-type-based. > Make astype data-type-based > --- > > Key: SPARK-35616 > URL: https://issues.apache.org/jira/browse/SPARK-35616 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > There are many type checks in `astype` methods. Since `DataTypeOps` is > introduced, we should refactor `astype` to make it data-type-based. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35593) Support shuffle data recovery on the reused PVCs
[ https://issues.apache.org/jira/browse/SPARK-35593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35593. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32730 [https://github.com/apache/spark/pull/32730] > Support shuffle data recovery on the reused PVCs > > > Key: SPARK-35593 > URL: https://issues.apache.org/jira/browse/SPARK-35593 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35593) Support shuffle data recovery on the reused PVCs
[ https://issues.apache.org/jira/browse/SPARK-35593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-35593: - Assignee: Dongjoon Hyun > Support shuffle data recovery on the reused PVCs > > > Key: SPARK-35593 > URL: https://issues.apache.org/jira/browse/SPARK-35593 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361245#comment-17361245 ] Parsa Shabani edited comment on SPARK-24295 at 6/10/21, 10:43 PM: -- Hello all, Any updates on this? [~kabhwan] are you still planning to merge your PR? We are having severe issues with the growing size of the .compact files. The functionality for the purging or truncating records for older syncs is badly needed in my opinion. was (Author: parshabani): Hello all, Any updates on this? [~kabhwan] are you still planning to merge your PR? We are having severe issues with the growing size of the .compact files. The functionality for the purging and truncating records for older syncs is badly needed in my opinion. > Purge Structured streaming FileStreamSinkLog metadata compact file data. > > > Key: SPARK-24295 > URL: https://issues.apache.org/jira/browse/SPARK-24295 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Iqbal Singh >Priority: Major > Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz > > > FileStreamSinkLog metadata logs are concatenated to a single compact file > after defined compact interval. > For long running jobs, compact file size can grow up to 10's of GB's, Causing > slowness while reading the data from FileStreamSinkLog dir as spark is > defaulting to the "__spark__metadata" dir for the read. > We need a functionality to purge the compact file size. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361245#comment-17361245 ] Parsa Shabani commented on SPARK-24295: --- Hello all, Any updates on this? [~kabhwan] are you still planning to merge your PR? We are having severe issues with the growing size of the .compact files. The functionality for the purging and truncating records for older syncs is badly needed in my opinion. > Purge Structured streaming FileStreamSinkLog metadata compact file data. > > > Key: SPARK-24295 > URL: https://issues.apache.org/jira/browse/SPARK-24295 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Iqbal Singh >Priority: Major > Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz > > > FileStreamSinkLog metadata logs are concatenated to a single compact file > after defined compact interval. > For long running jobs, compact file size can grow up to 10's of GB's, Causing > slowness while reading the data from FileStreamSinkLog dir as spark is > defaulting to the "__spark__metadata" dir for the read. > We need a functionality to purge the compact file size. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33350) Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data
[ https://issues.apache.org/jira/browse/SPARK-33350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-33350: --- Assignee: Ye Zhou > Add support to DiskBlockManager to create merge directory and to get the > local shuffle merged data > -- > > Key: SPARK-33350 > URL: https://issues.apache.org/jira/browse/SPARK-33350 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Assignee: Ye Zhou >Priority: Major > > DiskBlockManager should be able to create the {{merge_manager}} directory, > where the push-based merged shuffle files are written and also create > sub-dirs under it. > It should also be able to serve the local merged shuffle data/index/meta > files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33350) Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data
[ https://issues.apache.org/jira/browse/SPARK-33350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-33350. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32007 [https://github.com/apache/spark/pull/32007] > Add support to DiskBlockManager to create merge directory and to get the > local shuffle merged data > -- > > Key: SPARK-33350 > URL: https://issues.apache.org/jira/browse/SPARK-33350 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Assignee: Ye Zhou >Priority: Major > Fix For: 3.2.0 > > > DiskBlockManager should be able to create the {{merge_manager}} directory, > where the push-based merged shuffle files are written and also create > sub-dirs under it. > It should also be able to serve the local merged shuffle data/index/meta > files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35692) Use int to replace long for EXECUTOR_ID_COUNTER in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-35692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35692. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32837 [https://github.com/apache/spark/pull/32837] > Use int to replace long for EXECUTOR_ID_COUNTER in Kubernetes > - > > Key: SPARK-35692 > URL: https://issues.apache.org/jira/browse/SPARK-35692 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.2.0 > > > In Kubernetes deployment, EXECUTOR_ID_COUNTER can be int like other cluster > managers -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35692) Use int to replace long for EXECUTOR_ID_COUNTER in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-35692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-35692: - Assignee: Kent Yao > Use int to replace long for EXECUTOR_ID_COUNTER in Kubernetes > - > > Key: SPARK-35692 > URL: https://issues.apache.org/jira/browse/SPARK-35692 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > > In Kubernetes deployment, EXECUTOR_ID_COUNTER can be int like other cluster > managers -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35716) Support casting of timestamp without time zone to date type
[ https://issues.apache.org/jira/browse/SPARK-35716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-35716. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32869 [https://github.com/apache/spark/pull/32869] > Support casting of timestamp without time zone to date type > --- > > Key: SPARK-35716 > URL: https://issues.apache.org/jira/browse/SPARK-35716 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.2.0 > > > Extend the Cast expression and support TimestampWithoutTZType in casting to > DateType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
[ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361175#comment-17361175 ] binwei yang commented on SPARK-24579: - We implemented the native Apache Arrow data source and Arrow to DMatrix conversion natively. The solution can decrease the data preparing time to a negligible level. This is actually more important to inference than training. The link has the details. [https://medium.com/intel-analytics-software/optimizing-the-end-to-end-training-pipeline-on-apache-spark-clusters-80261d6a7b8c] During the implementation, there are 3 points we need to improve from Spark side: * There are different columnarBatch define in Spark. Now Arrow is mature and data format is stable. Why don't we use Arrow's ColumnarBatch directly? Like RDD cache. * We still don't have an easy way to get RDD[ColumnarBatch]. Here is what we use: {code:java} val qe = new QueryExecution(df.sparkSession, df.queryExecution.logical) { override protected def preparations: Seq[Rule[SparkPlan]] = { Seq( PlanSubqueries(sparkSession), EnsureRequirements(sparkSession.sessionState.conf), CollapseCodegenStages(sparkSession.sessionState.conf), ) } } val plan = qe.executedPlan val rdd: RDD[ColumnarBatch] = plan.executeColumnar() (rdd.map {...}) {code} Some way like this can be much easier to use {code:java} RDD[ColumnarBatch] columnardf = df.getColumnarDF() //or df.mapColumnarBatch() {code} * (Not related to this JIRA). XGBoost uses TBB mode, so we need to change task.cpus by stage level resource management APIs, which is good. But we still need to easily collect data from each task threads in the same executor. Ideally we should avoid the memory copy during the process. What we did now is like below. We created a customized CoalescePartitioner which combines the dmatrix pointers in the same executor, then concat them. I'm think some more general way to do this. {code:java} dmatrixpointerrdd.cache() dmatrixpointerrdd.foreachPartition(() =>) val coalescedrdd = dmatrixpointerrdd.coalesce(1, partitionCoalescer = Some(new ExecutorInProcessCoalescePartitioner())) coalescedrdd.mapPartitions (...){code} > SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks > > > Key: SPARK-24579 > URL: https://issues.apache.org/jira/browse/SPARK-24579 > Project: Spark > Issue Type: Epic > Components: ML, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen > Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange > between Apache Spark and DL%2FAI Frameworks .pdf > > > (see attached SPIP pdf for more details) > At the crossroads of big data and AI, we see both the success of Apache Spark > as a unified > analytics engine and the rise of AI frameworks like TensorFlow and Apache > MXNet (incubating). > Both big data and AI are indispensable components to drive business > innovation and there have > been multiple attempts from both communities to bring them together. > We saw efforts from AI community to implement data solutions for AI > frameworks like tf.data and tf.Transform. However, with 50+ data sources and > built-in SQL, DataFrames, and Streaming features, Spark remains the community > choice for big data. This is why we saw many efforts to integrate DL/AI > frameworks with Spark to leverage its power, for example, TFRecords data > source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project > Hydrogen, this SPIP takes a different angle at Spark + AI unification. > None of the integrations are possible without exchanging data between Spark > and external DL/AI frameworks. And the performance matters. However, there > doesn’t exist a standard way to exchange data and hence implementation and > performance optimization fall into pieces. For example, TensorFlowOnSpark > uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and > save data and pass the RDD records to TensorFlow in Python. And TensorFrames > converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s > Java API. How can we reduce the complexity? > The proposal here is to standardize the data exchange interface (or format) > between Spark and DL/AI frameworks and optimize data conversion from/to this > interface. So DL/AI frameworks can leverage Spark to load data virtually > from anywhere without spending extra effort building complex data solutions, > like reading features from a production data warehouse or streaming model > inference. Spark users can
[jira] [Resolved] (SPARK-32920) Add support in Spark driver to coordinate the finalization of the push/merge phase in push-based shuffle for a given shuffle and the initiation of the reduce stage
[ https://issues.apache.org/jira/browse/SPARK-32920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-32920. - Resolution: Fixed > Add support in Spark driver to coordinate the finalization of the push/merge > phase in push-based shuffle for a given shuffle and the initiation of the > reduce stage > --- > > Key: SPARK-32920 > URL: https://issues.apache.org/jira/browse/SPARK-32920 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Venkata krishnan Sowrirajan >Priority: Major > Fix For: 3.2.0 > > > With push-based shuffle, we are currently decoupling map task executions from > the shuffle block push process. Thus, when all map tasks finish, we might > want to wait for some small extra time to allow more shuffle blocks to get > pushed and merged. This requires some extra coordination in the Spark driver > when it transitions from a shuffle map stage to the corresponding reduce > stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32920) Add support in Spark driver to coordinate the finalization of the push/merge phase in push-based shuffle for a given shuffle and the initiation of the reduce stage
[ https://issues.apache.org/jira/browse/SPARK-32920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-32920: Fix Version/s: 3.2.0 > Add support in Spark driver to coordinate the finalization of the push/merge > phase in push-based shuffle for a given shuffle and the initiation of the > reduce stage > --- > > Key: SPARK-32920 > URL: https://issues.apache.org/jira/browse/SPARK-32920 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Venkata krishnan Sowrirajan >Priority: Major > Fix For: 3.2.0 > > > With push-based shuffle, we are currently decoupling map task executions from > the shuffle block push process. Thus, when all map tasks finish, we might > want to wait for some small extra time to allow more shuffle blocks to get > pushed and merged. This requires some extra coordination in the Spark driver > when it transitions from a shuffle map stage to the corresponding reduce > stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32920) Add support in Spark driver to coordinate the finalization of the push/merge phase in push-based shuffle for a given shuffle and the initiation of the reduce stage
[ https://issues.apache.org/jira/browse/SPARK-32920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-32920: --- Assignee: Venkata krishnan Sowrirajan > Add support in Spark driver to coordinate the finalization of the push/merge > phase in push-based shuffle for a given shuffle and the initiation of the > reduce stage > --- > > Key: SPARK-32920 > URL: https://issues.apache.org/jira/browse/SPARK-32920 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Venkata krishnan Sowrirajan >Priority: Major > > With push-based shuffle, we are currently decoupling map task executions from > the shuffle block push process. Thus, when all map tasks finish, we might > want to wait for some small extra time to allow more shuffle blocks to get > pushed and merged. This requires some extra coordination in the Spark driver > when it transitions from a shuffle map stage to the corresponding reduce > stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32709) Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)
[ https://issues.apache.org/jira/browse/SPARK-32709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361119#comment-17361119 ] Shashank Pedamallu commented on SPARK-32709: [~chengsu], thank you so much for the response. From my understanding, the only pending PR to get writing to hive bucketed tables to work is [https://github.com/apache/spark/pull/30003] However, it does not seem to lift the guard check applied here: [https://github.com/apache/spark/blob/44b695fbb06b0d89783b4838941c68543c5a5c8b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L182-L184] what else is pending? > Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2) > -- > > Key: SPARK-32709 > URL: https://issues.apache.org/jira/browse/SPARK-32709 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > > Hive ORC/Parquet write code path is same as data source v1 code path > (FileFormatWriter). This JIRA is to add the support to write Hive ORC/Parquet > bucketed table with hivehash. The change is to custom `bucketIdExpression` to > use hivehash when the table is Hive bucketed table, and the Hive version is > 1.x.y or 2.x.y. > > This will allow us write Hive/Presto-compatible bucketed table for Hive 1 and > 2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35716) Support casting of timestamp without time zone to date type
[ https://issues.apache.org/jira/browse/SPARK-35716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361117#comment-17361117 ] Apache Spark commented on SPARK-35716: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/32869 > Support casting of timestamp without time zone to date type > --- > > Key: SPARK-35716 > URL: https://issues.apache.org/jira/browse/SPARK-35716 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Extend the Cast expression and support TimestampWithoutTZType in casting to > DateType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35716) Support casting of timestamp without time zone to date type
[ https://issues.apache.org/jira/browse/SPARK-35716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35716: Assignee: Gengliang Wang (was: Apache Spark) > Support casting of timestamp without time zone to date type > --- > > Key: SPARK-35716 > URL: https://issues.apache.org/jira/browse/SPARK-35716 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Extend the Cast expression and support TimestampWithoutTZType in casting to > DateType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35716) Support casting of timestamp without time zone to date type
[ https://issues.apache.org/jira/browse/SPARK-35716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35716: Assignee: Apache Spark (was: Gengliang Wang) > Support casting of timestamp without time zone to date type > --- > > Key: SPARK-35716 > URL: https://issues.apache.org/jira/browse/SPARK-35716 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Extend the Cast expression and support TimestampWithoutTZType in casting to > DateType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35716) Support casting of timestamp without time zone to date type
[ https://issues.apache.org/jira/browse/SPARK-35716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361116#comment-17361116 ] Apache Spark commented on SPARK-35716: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/32869 > Support casting of timestamp without time zone to date type > --- > > Key: SPARK-35716 > URL: https://issues.apache.org/jira/browse/SPARK-35716 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Extend the Cast expression and support TimestampWithoutTZType in casting to > DateType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35716) Support casting of timestamp without time zone to date type
Gengliang Wang created SPARK-35716: -- Summary: Support casting of timestamp without time zone to date type Key: SPARK-35716 URL: https://issues.apache.org/jira/browse/SPARK-35716 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Extend the Cast expression and support TimestampWithoutTZType in casting to DateType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35296) Dataset.observe fails with an assertion
[ https://issues.apache.org/jira/browse/SPARK-35296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35296: --- Assignee: Kousuke Saruta > Dataset.observe fails with an assertion > --- > > Key: SPARK-35296 > URL: https://issues.apache.org/jira/browse/SPARK-35296 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: Tanel Kiis >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.0.3, 3.2.0, 3.1.3 > > Attachments: 2021-05-03_18-34.png > > > I hit this assertion error when using dataset.observe: > {code} > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:208) ~[scala-library-2.12.10.jar:?] > at > org.apache.spark.sql.execution.AggregatingAccumulator.setState(AggregatingAccumulator.scala:204) > ~[spark-sql_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.sql.execution.CollectMetricsExec.$anonfun$doExecute$2(CollectMetricsExec.scala:72) > ~[spark-sql_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.sql.execution.CollectMetricsExec.$anonfun$doExecute$2$adapted(CollectMetricsExec.scala:71) > ~[spark-sql_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:125) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1(TaskContextImpl.scala:124) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1$adapted(TaskContextImpl.scala:124) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1(TaskContextImpl.scala:137) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1$adapted(TaskContextImpl.scala:135) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > ~[scala-library-2.12.10.jar:?] > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > ~[scala-library-2.12.10.jar:?] > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > ~[scala-library-2.12.10.jar:?] > at > org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:135) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at org.apache.spark.scheduler.Task.run(Task.scala:147) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > [spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > [spark-core_2.12-3.1.1.jar:3.1.1] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_282] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_282] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] > {code} > A workaround, that I used was to add .coalesce(1) before calling this method. > It happens in a quite complex query and I have not been able to reproduce > this with a simpler query > Added an screenshot of the debugger, at the moment of exception > !2021-05-03_18-34.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35296) Dataset.observe fails with an assertion
[ https://issues.apache.org/jira/browse/SPARK-35296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35296. - Fix Version/s: 3.0.3 3.1.3 3.2.0 Resolution: Fixed Issue resolved by pull request 32786 [https://github.com/apache/spark/pull/32786] > Dataset.observe fails with an assertion > --- > > Key: SPARK-35296 > URL: https://issues.apache.org/jira/browse/SPARK-35296 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: Tanel Kiis >Priority: Major > Fix For: 3.2.0, 3.1.3, 3.0.3 > > Attachments: 2021-05-03_18-34.png > > > I hit this assertion error when using dataset.observe: > {code} > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:208) ~[scala-library-2.12.10.jar:?] > at > org.apache.spark.sql.execution.AggregatingAccumulator.setState(AggregatingAccumulator.scala:204) > ~[spark-sql_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.sql.execution.CollectMetricsExec.$anonfun$doExecute$2(CollectMetricsExec.scala:72) > ~[spark-sql_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.sql.execution.CollectMetricsExec.$anonfun$doExecute$2$adapted(CollectMetricsExec.scala:71) > ~[spark-sql_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:125) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1(TaskContextImpl.scala:124) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1$adapted(TaskContextImpl.scala:124) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1(TaskContextImpl.scala:137) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1$adapted(TaskContextImpl.scala:135) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > ~[scala-library-2.12.10.jar:?] > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > ~[scala-library-2.12.10.jar:?] > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > ~[scala-library-2.12.10.jar:?] > at > org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:135) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at org.apache.spark.scheduler.Task.run(Task.scala:147) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > [spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > [spark-core_2.12-3.1.1.jar:3.1.1] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_282] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_282] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] > {code} > A workaround, that I used was to add .coalesce(1) before calling this method. > It happens in a quite complex query and I have not been able to reproduce > this with a simpler query > Added an screenshot of the debugger, at the moment of exception > !2021-05-03_18-34.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34861) Support nested column in Spark vectorized readers
[ https://issues.apache.org/jira/browse/SPARK-34861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361090#comment-17361090 ] Chao Sun commented on SPARK-34861: -- Synced with [~chengsu] offline and I will take over this JIRA. > Support nested column in Spark vectorized readers > - > > Key: SPARK-34861 > URL: https://issues.apache.org/jira/browse/SPARK-34861 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Minor > > This is the umbrella task to track the overall progress. The task is to > support nested column type in Spark vectorized reader, namely Parquet and > ORC. Currently both Parquet and ORC vectorized readers do not support nested > column type (struct, array and map). We implemented nested column vectorized > reader for FB-ORC in our internal fork of Spark. We are seeing performance > improvement compared to non-vectorized reader when reading nested columns. In > addition, this can also help improve the non-nested column performance when > reading non-nested and nested columns together in one query. > > Parquet: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173] > > ORC: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35715) Option "--files" with local:// prefix is not honoured for Spark on kubernetes
[ https://issues.apache.org/jira/browse/SPARK-35715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361086#comment-17361086 ] Erik Krogen commented on SPARK-35715: - Not sure about k8s, but at least for YARN this is expected -- the {{local:}} prefix indicates that the file(s) shouldn't be copied because they're already present on the local filesystems. The {{file:}} scheme is supposed to be used to indicate something that's currently on your local FS but should be distributed for you. Actually, from my understanding, it looks like the bug is when you use {{jars}}, which shouldn't be copying anything since you've used the {{local}} scheme .. But again, not sure if k8s is supposed to work differently. Hopefully someone with more experience there can chime in. > Option "--files" with local:// prefix is not honoured for Spark on kubernetes > - > > Key: SPARK-35715 > URL: https://issues.apache.org/jira/browse/SPARK-35715 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2, 3.1.2 >Reporter: Pardhu Madipalli >Priority: Major > > When we provide a local file as a dependency using "--files" option, the file > is not getting copied to work directories of executors. > h5. Example 1: > > {code:java} > $SPARK_HOME/bin/spark-submit --master k8s://https:// \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.container.image= \ > --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ > --files local:///etc/xattr.conf \ > local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 > {code} > > h6. Content of Spark Executor work-dir: > > {code:java} > ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls > /opt/spark/work-dir/ > spark-examples_2.12-3.1.2.jar > {code} > > We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to > _/opt/spark/work-dir/ ._ > > > > {{Instead of using "–files", if we use "--jars" option the file is getting > copied as expected.}} > h5. Example 2: > {code:java} > $SPARK_HOME/bin/spark-submit --master k8s://https:// \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.container.image= \ > --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ > --jars local:///etc/xattr.conf \ > local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 > {code} > h6. Content of Spark Executor work-dir: > > {code:java} > ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls > /opt/spark/work-dir/ > spark-examples_2.12-3.1.2.jar > xattr.conf > {code} > We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to > _/opt/spark/work-dir/ ._ > > I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way > in both cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35715) Option "--files" with local:// prefix is not honoured for Spark on kubernetes
[ https://issues.apache.org/jira/browse/SPARK-35715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pardhu Madipalli updated SPARK-35715: - Description: When we provide a local file as a dependency using "--files" option, the file is not getting copied to work directories of executors. h5. Example 1: {code:java} $SPARK_HOME/bin/spark-submit --master k8s://https:// \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image= \ --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ --files local:///etc/xattr.conf \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 {code} h6. Content of Spark Executor work-dir: {code:java} ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls /opt/spark/work-dir/ spark-examples_2.12-3.1.2.jar {code} We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to _/opt/spark/work-dir/ ._ {{Instead of using "–files", if we use "--jars" option the file is getting copied as expected.}} h5. Example 2: {code:java} $SPARK_HOME/bin/spark-submit --master k8s://https:// \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image= \ --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ --jars local:///etc/xattr.conf \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 {code} h6. Content of Spark Executor work-dir: {code:java} ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls /opt/spark/work-dir/ spark-examples_2.12-3.1.2.jar xattr.conf {code} We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to _/opt/spark/work-dir/ ._ I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way in both cases. was: When we provide a local file as a dependency using "--files" option, the file is not getting copied to work directories of executors. h5. Example 1: {code:java} $SPARK_HOME/bin/spark-submit --master k8s://https:// \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image= \ --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ --files local:///etc/xattr.conf \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 {code} h6. Content of Spark Executor work-dir: {code:java} ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls /opt/spark/work-dir/ spark-examples_2.12-3.1.2.jar {code} We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to _/opt/spark/work-dir/ ._ Instead of using "--files", if we use "--jars" option the file is getting copied as expected. h5. Example 2: {code:java} $SPARK_HOME/bin/spark-submit --master k8s://https:// \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image= \ --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ --jars local:///etc/xattr.conf \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 {code} h6. Content of Spark Executor work-dir: {code:java} ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls /opt/spark/work-dir/ spark-examples_2.12-3.1.2.jar xattr.conf {code} We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to _/opt/spark/work-dir/ ._ I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way in both cases. > Option "--files" with local:// prefix is not honoured for Spark on kubernetes > - > > Key: SPARK-35715 > URL: https://issues.apache.org/jira/browse/SPARK-35715 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2, 3.1.2 >Reporter: Pardhu Madipalli >Priority: Major > > When we provide a local file as a dependency using "--files" option, the file > is not getting copied to work directories of executors. > h5. Example 1: > > {code:java} > $SPARK_HOME/bin/spark-submit --master k8s://https:// \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.container.image= \ > --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ > --files local:///etc/xattr.conf \ > local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 > {code} > > h6. Content of Spark Executor work-dir: > > {code:java} > ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls > /opt/spark/work-dir/ > spark-examples_2.12-3.1.2.jar > {code} > >
[jira] [Updated] (SPARK-35715) Option "--files" with local:// prefix is not honoured for Spark on kubernetes
[ https://issues.apache.org/jira/browse/SPARK-35715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pardhu Madipalli updated SPARK-35715: - Description: When we provide a local file as a dependency using "--files" option, the file is not getting copied to work directories of executors. h5. Example 1: {code:java} $SPARK_HOME/bin/spark-submit --master k8s://https:// \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image= \ --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ --files local:///etc/xattr.conf \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 {code} h6. Content of Spark Executor work-dir: {code:java} ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls /opt/spark/work-dir/ spark-examples_2.12-3.1.2.jar {code} We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to _/opt/spark/work-dir/ ._ Instead of using "--files", if we use "--jars" option the file is getting copied as expected. h5. Example 2: {code:java} $SPARK_HOME/bin/spark-submit --master k8s://https:// \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image= \ --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ --jars local:///etc/xattr.conf \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 {code} h6. Content of Spark Executor work-dir: {code:java} ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls /opt/spark/work-dir/ spark-examples_2.12-3.1.2.jar xattr.conf {code} We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to _/opt/spark/work-dir/ ._ I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way in both cases. was: When we provide a local file as a dependency using "--files" option, the file is not getting copied to work directories of executors. h5. Example 1: {code:java} $SPARK_HOME/bin/spark-submit --master k8s://https:// \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image= \ --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ --files local:///etc/xattr.conf \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 {code} h6. Content of Spark Executor work-dir: {code:java} ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls /opt/spark/work-dir/ spark-examples_2.12-3.1.2.jar {code} We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to _/opt/spark/work-dir/ ._ Instead of using "--files", if we use "--jars" option the file is getting copied as expected. h5. Example 2: {code:java} $SPARK_HOME/bin/spark-submit --master k8s://https:// \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image= \ --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ --jars local:///etc/xattr.conf \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 {code} h6. Content of Spark Executor work-dir: {code:java} ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls /opt/spark/work-dir/ spark-examples_2.12-3.1.2.jar xattr.conf {code} We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to _/opt/spark/work-dir/ ._ I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way in both cases. > Option "--files" with local:// prefix is not honoured for Spark on kubernetes > - > > Key: SPARK-35715 > URL: https://issues.apache.org/jira/browse/SPARK-35715 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2, 3.1.2 >Reporter: Pardhu Madipalli >Priority: Major > > When we provide a local file as a dependency using "--files" option, the file > is not getting copied to work directories of executors. > h5. Example 1: > > {code:java} > $SPARK_HOME/bin/spark-submit --master k8s://https:// \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.container.image= \ > --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ > --files local:///etc/xattr.conf \ > local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 > {code} > > h6. Content of Spark Executor work-dir: > > {code:java} > ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls > /opt/spark/work-dir/ > spark-examples_2.12-3.1.2.jar > {code} > > We
[jira] [Created] (SPARK-35715) Option "--files" with local:// prefix is not honoured for Spark on kubernetes
Pardhu Madipalli created SPARK-35715: Summary: Option "--files" with local:// prefix is not honoured for Spark on kubernetes Key: SPARK-35715 URL: https://issues.apache.org/jira/browse/SPARK-35715 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.1.2, 3.0.2 Reporter: Pardhu Madipalli When we provide a local file as a dependency using "--files" option, the file is not getting copied to work directories of executors. h5. Example 1: {code:java} $SPARK_HOME/bin/spark-submit --master k8s://https:// \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image= \ --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ --files local:///etc/xattr.conf \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 {code} h6. Content of Spark Executor work-dir: {code:java} ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls /opt/spark/work-dir/ spark-examples_2.12-3.1.2.jar {code} We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to _/opt/spark/work-dir/ ._ Instead of using "--files", if we use "--jars" option the file is getting copied as expected. h5. Example 2: {code:java} $SPARK_HOME/bin/spark-submit --master k8s://https:// \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image= \ --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ --jars local:///etc/xattr.conf \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100 {code} h6. Content of Spark Executor work-dir: {code:java} ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls /opt/spark/work-dir/ spark-examples_2.12-3.1.2.jar xattr.conf {code} We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to _/opt/spark/work-dir/ ._ I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way in both cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35653) [SQL] CatalystToExternalMap interpreted path fails for Map with case classes as keys or values
[ https://issues.apache.org/jira/browse/SPARK-35653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-35653. - Fix Version/s: 3.0.3 3.1.3 3.2.0 Resolution: Fixed Issue resolved by pull request 32783 [https://github.com/apache/spark/pull/32783] > [SQL] CatalystToExternalMap interpreted path fails for Map with case classes > as keys or values > -- > > Key: SPARK-35653 > URL: https://issues.apache.org/jira/browse/SPARK-35653 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: Emil Ejbyfeldt >Assignee: Emil Ejbyfeldt >Priority: Major > Fix For: 3.2.0, 3.1.3, 3.0.3 > > > Interpreted path deserialization fails for Map with case classes as keys or > values while the codegen path works correctly. > To reproduce the issue one can add test cases to the ExpressionEncoderSuite. > For example adding the following > {noformat} > case class IntAndString(i: Int, s: String) > encodeDecodeTest(Map(1 -> IntAndString(1, "a")), "map with case class as > value") > {noformat} > It will succeed for the code gen path while the interpreted path will fail > with > {noformat} > [info] - encode/decode for map with case class as value: Map(1 -> > IntAndString(1,a)) (interpreted path) *** FAILED *** (64 milliseconds) > [info] Encoded/Decoded data does not match input data > [info] > [info] in: Map(1 -> IntAndString(1,a)) > [info] out: Map(1 -> [1,a]) > [info] types: scala.collection.immutable.Map$Map1 [info] > [info] Encoded Data: > [org.apache.spark.sql.catalyst.expressions.UnsafeMapData@5ecf5d9e] > [info] Schema: value#823 > [info] root > [info] -- value: map (nullable = true) > [info] |-- key: integer > [info] |-- value: struct (valueContainsNull = true) > [info] | |-- i: integer (nullable = false) > [info] | |-- s: string (nullable = true) > [info] > [info] > [info] fromRow Expressions: > [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, > IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, > IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), > if (isnull(lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, > 179))) null else newInstance(class > org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, > map>, true], interface > scala.collection.immutable.Map > [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) > [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) > [info] :- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) > [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, > 179))) null else newInstance(class > org.apache.spark.sql.catalyst.encoders.IntAndString) > [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)) > [info] : : +- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) > [info] : :- null > [info] : +- newInstance(class > org.apache.spark.sql.catalyst.encoders.IntAndString) > [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, > 179).i) > [info] : : +- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i > [info] : : +- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) > [info] : +- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, > 179).s.toString > [info] : +- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s > [info] : +- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) > [info] +- input[0, map>, true] > (ExpressionEncoderSuite.scala:627) > {noformat} > So the value was not correctly deserialized in the interpreted path. > I have prepared a PR that I will submit for fixing this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Assigned] (SPARK-35653) [SQL] CatalystToExternalMap interpreted path fails for Map with case classes as keys or values
[ https://issues.apache.org/jira/browse/SPARK-35653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-35653: --- Assignee: Emil Ejbyfeldt > [SQL] CatalystToExternalMap interpreted path fails for Map with case classes > as keys or values > -- > > Key: SPARK-35653 > URL: https://issues.apache.org/jira/browse/SPARK-35653 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: Emil Ejbyfeldt >Assignee: Emil Ejbyfeldt >Priority: Major > > Interpreted path deserialization fails for Map with case classes as keys or > values while the codegen path works correctly. > To reproduce the issue one can add test cases to the ExpressionEncoderSuite. > For example adding the following > {noformat} > case class IntAndString(i: Int, s: String) > encodeDecodeTest(Map(1 -> IntAndString(1, "a")), "map with case class as > value") > {noformat} > It will succeed for the code gen path while the interpreted path will fail > with > {noformat} > [info] - encode/decode for map with case class as value: Map(1 -> > IntAndString(1,a)) (interpreted path) *** FAILED *** (64 milliseconds) > [info] Encoded/Decoded data does not match input data > [info] > [info] in: Map(1 -> IntAndString(1,a)) > [info] out: Map(1 -> [1,a]) > [info] types: scala.collection.immutable.Map$Map1 [info] > [info] Encoded Data: > [org.apache.spark.sql.catalyst.expressions.UnsafeMapData@5ecf5d9e] > [info] Schema: value#823 > [info] root > [info] -- value: map (nullable = true) > [info] |-- key: integer > [info] |-- value: struct (valueContainsNull = true) > [info] | |-- i: integer (nullable = false) > [info] | |-- s: string (nullable = true) > [info] > [info] > [info] fromRow Expressions: > [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, > IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, > IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), > if (isnull(lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, > 179))) null else newInstance(class > org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, > map>, true], interface > scala.collection.immutable.Map > [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) > [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) > [info] :- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) > [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, > 179))) null else newInstance(class > org.apache.spark.sql.catalyst.encoders.IntAndString) > [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)) > [info] : : +- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) > [info] : :- null > [info] : +- newInstance(class > org.apache.spark.sql.catalyst.encoders.IntAndString) > [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, > 179).i) > [info] : : +- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i > [info] : : +- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) > [info] : +- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, > 179).s.toString > [info] : +- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s > [info] : +- lambdavariable(CatalystToExternalMap_value, > StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) > [info] +- input[0, map>, true] > (ExpressionEncoderSuite.scala:627) > {noformat} > So the value was not correctly deserialized in the interpreted path. > I have prepared a PR that I will submit for fixing this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35711) Support casting of timestamp without time zone to timestamp type
[ https://issues.apache.org/jira/browse/SPARK-35711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-35711. Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32864 [https://github.com/apache/spark/pull/32864] > Support casting of timestamp without time zone to timestamp type > > > Key: SPARK-35711 > URL: https://issues.apache.org/jira/browse/SPARK-35711 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.2.0 > > > Extend the Cast expression and support TimestampWithoutTZType in casting to > TimestampType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35714) Bug fix for deadlock during the executor shutdown
[ https://issues.apache.org/jira/browse/SPARK-35714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360981#comment-17360981 ] Apache Spark commented on SPARK-35714: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/32868 > Bug fix for deadlock during the executor shutdown > - > > Key: SPARK-35714 > URL: https://issues.apache.org/jira/browse/SPARK-35714 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Wan Kun >Priority: Minor > > When a executor received a TERM signal, it (the second TERM signal) will lock > java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM. > Shutdown will call SparkShutdownHook to shutdown the executor. > During the executor shutdown phase, RemoteProcessDisconnected event will be > send to the RPC inbox, and then WorkerWatcher will try to call > System.exit(-1) again. > Because java.lang.Shutdown has already locked, a deadlock has occurred. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35713) Bug fix for thread leak in JobCancellationSuite
[ https://issues.apache.org/jira/browse/SPARK-35713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35713: Assignee: (was: Apache Spark) > Bug fix for thread leak in JobCancellationSuite > --- > > Key: SPARK-35713 > URL: https://issues.apache.org/jira/browse/SPARK-35713 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Wan Kun >Priority: Minor > > When we call Thread.interrupt() method, that thread's interrupt status will > be set but it may not really interrupt. > So when spark task runs in an infinite loop, spark context may fail to > interrupt the task thread, and resulting in thread leak. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35714) Bug fix for deadlock during the executor shutdown
[ https://issues.apache.org/jira/browse/SPARK-35714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35714: Assignee: Apache Spark > Bug fix for deadlock during the executor shutdown > - > > Key: SPARK-35714 > URL: https://issues.apache.org/jira/browse/SPARK-35714 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Minor > > When a executor received a TERM signal, it (the second TERM signal) will lock > java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM. > Shutdown will call SparkShutdownHook to shutdown the executor. > During the executor shutdown phase, RemoteProcessDisconnected event will be > send to the RPC inbox, and then WorkerWatcher will try to call > System.exit(-1) again. > Because java.lang.Shutdown has already locked, a deadlock has occurred. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35714) Bug fix for deadlock during the executor shutdown
[ https://issues.apache.org/jira/browse/SPARK-35714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360977#comment-17360977 ] Apache Spark commented on SPARK-35714: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/32868 > Bug fix for deadlock during the executor shutdown > - > > Key: SPARK-35714 > URL: https://issues.apache.org/jira/browse/SPARK-35714 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Wan Kun >Priority: Minor > > When a executor received a TERM signal, it (the second TERM signal) will lock > java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM. > Shutdown will call SparkShutdownHook to shutdown the executor. > During the executor shutdown phase, RemoteProcessDisconnected event will be > send to the RPC inbox, and then WorkerWatcher will try to call > System.exit(-1) again. > Because java.lang.Shutdown has already locked, a deadlock has occurred. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35713) Bug fix for thread leak in JobCancellationSuite
[ https://issues.apache.org/jira/browse/SPARK-35713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35713: Assignee: Apache Spark > Bug fix for thread leak in JobCancellationSuite > --- > > Key: SPARK-35713 > URL: https://issues.apache.org/jira/browse/SPARK-35713 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Minor > > When we call Thread.interrupt() method, that thread's interrupt status will > be set but it may not really interrupt. > So when spark task runs in an infinite loop, spark context may fail to > interrupt the task thread, and resulting in thread leak. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35714) Bug fix for deadlock during the executor shutdown
[ https://issues.apache.org/jira/browse/SPARK-35714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35714: Assignee: (was: Apache Spark) > Bug fix for deadlock during the executor shutdown > - > > Key: SPARK-35714 > URL: https://issues.apache.org/jira/browse/SPARK-35714 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Wan Kun >Priority: Minor > > When a executor received a TERM signal, it (the second TERM signal) will lock > java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM. > Shutdown will call SparkShutdownHook to shutdown the executor. > During the executor shutdown phase, RemoteProcessDisconnected event will be > send to the RPC inbox, and then WorkerWatcher will try to call > System.exit(-1) again. > Because java.lang.Shutdown has already locked, a deadlock has occurred. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35713) Bug fix for thread leak in JobCancellationSuite
[ https://issues.apache.org/jira/browse/SPARK-35713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360976#comment-17360976 ] Apache Spark commented on SPARK-35713: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/32866 > Bug fix for thread leak in JobCancellationSuite > --- > > Key: SPARK-35713 > URL: https://issues.apache.org/jira/browse/SPARK-35713 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Wan Kun >Priority: Minor > > When we call Thread.interrupt() method, that thread's interrupt status will > be set but it may not really interrupt. > So when spark task runs in an infinite loop, spark context may fail to > interrupt the task thread, and resulting in thread leak. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35714) Bug fix for deadlock during the executor shutdown
[ https://issues.apache.org/jira/browse/SPARK-35714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-35714: Description: When a executor received a TERM signal, it (the second TERM signal) will lock java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM. Shutdown will call SparkShutdownHook to shutdown the executor. During the executor shutdown phase, RemoteProcessDisconnected event will be send to the RPC inbox, and then WorkerWatcher will try to call System.exit(-1) again. Because java.lang.Shutdown has already locked, a deadlock has occurred. was: WHen a executor recived a TERM signal, it (the second TERM signal) will lock java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM. Shutdown will call SparkShutdownHook to shutdown the executor. During the executor shutdown phase, RemoteProcessDisconnected event will be send to the RPC inbox, and then WorkerWatcher will try to call System.exit(-1) again. Because java.lang.Shutdown has already locked, a deadlock has occurred. > Bug fix for deadlock during the executor shutdown > - > > Key: SPARK-35714 > URL: https://issues.apache.org/jira/browse/SPARK-35714 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Wan Kun >Priority: Minor > > When a executor received a TERM signal, it (the second TERM signal) will lock > java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM. > Shutdown will call SparkShutdownHook to shutdown the executor. > During the executor shutdown phase, RemoteProcessDisconnected event will be > send to the RPC inbox, and then WorkerWatcher will try to call > System.exit(-1) again. > Because java.lang.Shutdown has already locked, a deadlock has occurred. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35714) Bug fix for deadlock during the executor shutdown
Wan Kun created SPARK-35714: --- Summary: Bug fix for deadlock during the executor shutdown Key: SPARK-35714 URL: https://issues.apache.org/jira/browse/SPARK-35714 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.2 Reporter: Wan Kun WHen a executor recived a TERM signal, it (the second TERM signal) will lock java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM. Shutdown will call SparkShutdownHook to shutdown the executor. During the executor shutdown phase, RemoteProcessDisconnected event will be send to the RPC inbox, and then WorkerWatcher will try to call System.exit(-1) again. Because java.lang.Shutdown has already locked, a deadlock has occurred. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35713) Bug fix for thread leak in JobCancellationSuite
[ https://issues.apache.org/jira/browse/SPARK-35713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-35713: Summary: Bug fix for thread leak in JobCancellationSuite (was: Bugfix for thread leak in JobCancellationSuite) > Bug fix for thread leak in JobCancellationSuite > --- > > Key: SPARK-35713 > URL: https://issues.apache.org/jira/browse/SPARK-35713 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Wan Kun >Priority: Minor > > When we call Thread.interrupt() method, that thread's interrupt status will > be set but it may not really interrupt. > So when spark task runs in an infinite loop, spark context may fail to > interrupt the task thread, and resulting in thread leak. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35713) Bugfix for thread leak in JobCancellationSuite
[ https://issues.apache.org/jira/browse/SPARK-35713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-35713: Description: When we call Thread.interrupt() method, that thread's interrupt status will be set but it may not really interrupt. So when spark task runs in an infinite loop, spark context may fail to interrupt the task thread, and resulting in thread leak. was: When the task runs in an infinite loop, spark context may fail to interrupt the task thread, resulting in thread leak. > Bugfix for thread leak in JobCancellationSuite > -- > > Key: SPARK-35713 > URL: https://issues.apache.org/jira/browse/SPARK-35713 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Wan Kun >Priority: Minor > > When we call Thread.interrupt() method, that thread's interrupt status will > be set but it may not really interrupt. > So when spark task runs in an infinite loop, spark context may fail to > interrupt the task thread, and resulting in thread leak. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35713) Bugfix for thread leak in JobCancellationSuite
Wan Kun created SPARK-35713: --- Summary: Bugfix for thread leak in JobCancellationSuite Key: SPARK-35713 URL: https://issues.apache.org/jira/browse/SPARK-35713 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.2 Reporter: Wan Kun When the task runs in an infinite loop, spark context may fail to interrupt the task thread, resulting in thread leak. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35712) Simplify ResolveAggregateFunctions
[ https://issues.apache.org/jira/browse/SPARK-35712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35712: Assignee: Apache Spark > Simplify ResolveAggregateFunctions > -- > > Key: SPARK-35712 > URL: https://issues.apache.org/jira/browse/SPARK-35712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35712) Simplify ResolveAggregateFunctions
[ https://issues.apache.org/jira/browse/SPARK-35712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35712: Assignee: (was: Apache Spark) > Simplify ResolveAggregateFunctions > -- > > Key: SPARK-35712 > URL: https://issues.apache.org/jira/browse/SPARK-35712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35712) Simplify ResolveAggregateFunctions
[ https://issues.apache.org/jira/browse/SPARK-35712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360892#comment-17360892 ] Apache Spark commented on SPARK-35712: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/32470 > Simplify ResolveAggregateFunctions > -- > > Key: SPARK-35712 > URL: https://issues.apache.org/jira/browse/SPARK-35712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35712) Simplify ResolveAggregateFunctions
Wenchen Fan created SPARK-35712: --- Summary: Simplify ResolveAggregateFunctions Key: SPARK-35712 URL: https://issues.apache.org/jira/browse/SPARK-35712 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35701) Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys
[ https://issues.apache.org/jira/browse/SPARK-35701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35701: Assignee: Haiyang Sun (was: Apache Spark) > Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys > --- > > Key: SPARK-35701 > URL: https://issues.apache.org/jira/browse/SPARK-35701 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: Haiyang Sun >Assignee: Haiyang Sun >Priority: Major > > The global locks used to protect {{SQLConf.sqlConfEntries}} map and the > {{SQLConf.staticConfKeys}} set can cause significant contention on the > SQLConf instance in a concurrent setting. > Using copy-on-write versions should reduce the contention given that > modifications to the configs are relatively rare. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35701) Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys
[ https://issues.apache.org/jira/browse/SPARK-35701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35701: Assignee: Apache Spark (was: Haiyang Sun) > Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys > --- > > Key: SPARK-35701 > URL: https://issues.apache.org/jira/browse/SPARK-35701 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: Haiyang Sun >Assignee: Apache Spark >Priority: Major > > The global locks used to protect {{SQLConf.sqlConfEntries}} map and the > {{SQLConf.staticConfKeys}} set can cause significant contention on the > SQLConf instance in a concurrent setting. > Using copy-on-write versions should reduce the contention given that > modifications to the configs are relatively rare. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35701) Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys
[ https://issues.apache.org/jira/browse/SPARK-35701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360881#comment-17360881 ] Apache Spark commented on SPARK-35701: -- User 'haiyangsun-db' has created a pull request for this issue: https://github.com/apache/spark/pull/32865 > Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys > --- > > Key: SPARK-35701 > URL: https://issues.apache.org/jira/browse/SPARK-35701 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: Haiyang Sun >Assignee: Haiyang Sun >Priority: Major > > The global locks used to protect {{SQLConf.sqlConfEntries}} map and the > {{SQLConf.staticConfKeys}} set can cause significant contention on the > SQLConf instance in a concurrent setting. > Using copy-on-write versions should reduce the contention given that > modifications to the configs are relatively rare. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35701) Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys
[ https://issues.apache.org/jira/browse/SPARK-35701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35701: Assignee: Apache Spark (was: Haiyang Sun) > Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys > --- > > Key: SPARK-35701 > URL: https://issues.apache.org/jira/browse/SPARK-35701 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: Haiyang Sun >Assignee: Apache Spark >Priority: Major > > The global locks used to protect {{SQLConf.sqlConfEntries}} map and the > {{SQLConf.staticConfKeys}} set can cause significant contention on the > SQLConf instance in a concurrent setting. > Using copy-on-write versions should reduce the contention given that > modifications to the configs are relatively rare. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35711) Support casting of timestamp without time zone to timestamp type
[ https://issues.apache.org/jira/browse/SPARK-35711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360759#comment-17360759 ] Apache Spark commented on SPARK-35711: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/32864 > Support casting of timestamp without time zone to timestamp type > > > Key: SPARK-35711 > URL: https://issues.apache.org/jira/browse/SPARK-35711 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Extend the Cast expression and support TimestampWithoutTZType in casting to > TimestampType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35711) Support casting of timestamp without time zone to timestamp type
[ https://issues.apache.org/jira/browse/SPARK-35711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360758#comment-17360758 ] Apache Spark commented on SPARK-35711: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/32864 > Support casting of timestamp without time zone to timestamp type > > > Key: SPARK-35711 > URL: https://issues.apache.org/jira/browse/SPARK-35711 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Extend the Cast expression and support TimestampWithoutTZType in casting to > TimestampType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35711) Support casting of timestamp without time zone to timestamp type
[ https://issues.apache.org/jira/browse/SPARK-35711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35711: Assignee: Apache Spark (was: Gengliang Wang) > Support casting of timestamp without time zone to timestamp type > > > Key: SPARK-35711 > URL: https://issues.apache.org/jira/browse/SPARK-35711 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Extend the Cast expression and support TimestampWithoutTZType in casting to > TimestampType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35711) Support casting of timestamp without time zone to timestamp type
[ https://issues.apache.org/jira/browse/SPARK-35711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35711: Assignee: Gengliang Wang (was: Apache Spark) > Support casting of timestamp without time zone to timestamp type > > > Key: SPARK-35711 > URL: https://issues.apache.org/jira/browse/SPARK-35711 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Extend the Cast expression and support TimestampWithoutTZType in casting to > TimestampType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35711) Support casting of timestamp without time zone to timestamp type
Gengliang Wang created SPARK-35711: -- Summary: Support casting of timestamp without time zone to timestamp type Key: SPARK-35711 URL: https://issues.apache.org/jira/browse/SPARK-35711 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Extend the Cast expression and support TimestampWithoutTZType in casting to TimestampType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35652) Different Behaviour join vs joinWith in self joining
[ https://issues.apache.org/jira/browse/SPARK-35652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35652: Assignee: (was: Apache Spark) > Different Behaviour join vs joinWith in self joining > > > Key: SPARK-35652 > URL: https://issues.apache.org/jira/browse/SPARK-35652 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: {color:#172b4d}Spark 3.1.2{color} > Scala 2.12 >Reporter: Wassim Almaaoui >Priority: Critical > > It seems like spark inner join is performing a cartesian join in self joining > using `joinWith` and an inner join using `join` > Snippet: > > {code:java} > scala> val df = spark.range(0,5) > df: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> df.show > +---+ > | id| > +---+ > | 0| > | 1| > | 2| > | 3| > | 4| > +---+ > scala> df.join(df, df("id") === df("id")).count > 21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate, > 'id#1649L = id#1649L'. Perhaps you need to use aliases. > res21: Long = 5 > scala> df.joinWith(df, df("id") === df("id")).count > 21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate, > 'id#1649L = id#1649L'. Perhaps you need to use aliases. > res22: Long = 25 > {code} > According to the comment in code source, joinWith is expected to manage this > case, right? > {code:java} > def joinWith[U](other: Dataset[U], condition: Column, joinType: String): > Dataset[(T, U)] = { > // Creates a Join node and resolve it first, to get join condition > resolved, self-join resolved, > // etc. > {code} > I find it weird that join and joinWith haven't the same behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35652) Different Behaviour join vs joinWith in self joining
[ https://issues.apache.org/jira/browse/SPARK-35652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360747#comment-17360747 ] Apache Spark commented on SPARK-35652: -- User 'dgd-contributor' has created a pull request for this issue: https://github.com/apache/spark/pull/32863 > Different Behaviour join vs joinWith in self joining > > > Key: SPARK-35652 > URL: https://issues.apache.org/jira/browse/SPARK-35652 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: {color:#172b4d}Spark 3.1.2{color} > Scala 2.12 >Reporter: Wassim Almaaoui >Priority: Critical > > It seems like spark inner join is performing a cartesian join in self joining > using `joinWith` and an inner join using `join` > Snippet: > > {code:java} > scala> val df = spark.range(0,5) > df: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> df.show > +---+ > | id| > +---+ > | 0| > | 1| > | 2| > | 3| > | 4| > +---+ > scala> df.join(df, df("id") === df("id")).count > 21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate, > 'id#1649L = id#1649L'. Perhaps you need to use aliases. > res21: Long = 5 > scala> df.joinWith(df, df("id") === df("id")).count > 21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate, > 'id#1649L = id#1649L'. Perhaps you need to use aliases. > res22: Long = 25 > {code} > According to the comment in code source, joinWith is expected to manage this > case, right? > {code:java} > def joinWith[U](other: Dataset[U], condition: Column, joinType: String): > Dataset[(T, U)] = { > // Creates a Join node and resolve it first, to get join condition > resolved, self-join resolved, > // etc. > {code} > I find it weird that join and joinWith haven't the same behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35652) Different Behaviour join vs joinWith in self joining
[ https://issues.apache.org/jira/browse/SPARK-35652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35652: Assignee: Apache Spark > Different Behaviour join vs joinWith in self joining > > > Key: SPARK-35652 > URL: https://issues.apache.org/jira/browse/SPARK-35652 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: {color:#172b4d}Spark 3.1.2{color} > Scala 2.12 >Reporter: Wassim Almaaoui >Assignee: Apache Spark >Priority: Critical > > It seems like spark inner join is performing a cartesian join in self joining > using `joinWith` and an inner join using `join` > Snippet: > > {code:java} > scala> val df = spark.range(0,5) > df: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> df.show > +---+ > | id| > +---+ > | 0| > | 1| > | 2| > | 3| > | 4| > +---+ > scala> df.join(df, df("id") === df("id")).count > 21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate, > 'id#1649L = id#1649L'. Perhaps you need to use aliases. > res21: Long = 5 > scala> df.joinWith(df, df("id") === df("id")).count > 21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate, > 'id#1649L = id#1649L'. Perhaps you need to use aliases. > res22: Long = 25 > {code} > According to the comment in code source, joinWith is expected to manage this > case, right? > {code:java} > def joinWith[U](other: Dataset[U], condition: Column, joinType: String): > Dataset[(T, U)] = { > // Creates a Join node and resolve it first, to get join condition > resolved, self-join resolved, > // etc. > {code} > I find it weird that join and joinWith haven't the same behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35695) QueryExecutionListener does not see any observed metrics fired before persist/cache
[ https://issues.apache.org/jira/browse/SPARK-35695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35695: Assignee: (was: Apache Spark) > QueryExecutionListener does not see any observed metrics fired before > persist/cache > --- > > Key: SPARK-35695 > URL: https://issues.apache.org/jira/browse/SPARK-35695 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Major > > This example properly fires the event > {code} > spark.range(100) > .observe( > name = "other_event", > avg($"id").cast("int").as("avg_val")) > .collect() > {code} > But when I add persist, then no event is fired or seen (not sure which): > {code} > spark.range(100) > .observe( > name = "my_event", > avg($"id").cast("int").as("avg_val")) > .persist() > .collect() > {code} > The listener: > {code} > val metricMaps = ArrayBuffer.empty[Map[String, Row]] > val listener = new QueryExecutionListener { > override def onSuccess(funcName: String, qe: QueryExecution, duration: > Long): Unit = { > metricMaps += qe.observedMetrics > } > override def onFailure(funcName: String, qe: QueryExecution, exception: > Exception): Unit = { > // No-op > } > } > spark.listenerManager.register(listener) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35695) QueryExecutionListener does not see any observed metrics fired before persist/cache
[ https://issues.apache.org/jira/browse/SPARK-35695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35695: Assignee: Apache Spark > QueryExecutionListener does not see any observed metrics fired before > persist/cache > --- > > Key: SPARK-35695 > URL: https://issues.apache.org/jira/browse/SPARK-35695 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Assignee: Apache Spark >Priority: Major > > This example properly fires the event > {code} > spark.range(100) > .observe( > name = "other_event", > avg($"id").cast("int").as("avg_val")) > .collect() > {code} > But when I add persist, then no event is fired or seen (not sure which): > {code} > spark.range(100) > .observe( > name = "my_event", > avg($"id").cast("int").as("avg_val")) > .persist() > .collect() > {code} > The listener: > {code} > val metricMaps = ArrayBuffer.empty[Map[String, Row]] > val listener = new QueryExecutionListener { > override def onSuccess(funcName: String, qe: QueryExecution, duration: > Long): Unit = { > metricMaps += qe.observedMetrics > } > override def onFailure(funcName: String, qe: QueryExecution, exception: > Exception): Unit = { > // No-op > } > } > spark.listenerManager.register(listener) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org