date:20210610

[jira] [Assigned] (SPARK-35722) wait until something does get queued

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35722:


Assignee: (was: Apache Spark)

> wait until something does get queued
> 
>
> Key: SPARK-35722
> URL: https://issues.apache.org/jira/browse/SPARK-35722
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.2.0
>
>
>  if nothing has been added to the queue, should wait until something does get 
> queued instead of loop after timeout.
> It prevents an ineffective cycle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35722) wait until something does get queued

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35722:


Assignee: Apache Spark

> wait until something does get queued
> 
>
> Key: SPARK-35722
> URL: https://issues.apache.org/jira/browse/SPARK-35722
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: yikf
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
>  if nothing has been added to the queue, should wait until something does get 
> queued instead of loop after timeout.
> It prevents an ineffective cycle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35722) wait until something does get queued

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361416#comment-17361416
 ] 

Apache Spark commented on SPARK-35722:
--

User 'yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/32876

> wait until something does get queued
> 
>
> Key: SPARK-35722
> URL: https://issues.apache.org/jira/browse/SPARK-35722
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.2.0
>
>
>  if nothing has been added to the queue, should wait until something does get 
> queued instead of loop after timeout.
> It prevents an ineffective cycle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35718) Support casting of Date to timestamp without time zone type

2021-06-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35718.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32873
[https://github.com/apache/spark/pull/32873]

> Support casting of Date to timestamp without time zone type
> ---
>
> Key: SPARK-35718
> URL: https://issues.apache.org/jira/browse/SPARK-35718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Extend the Cast expression and support DateType in casting to 
> TimestampWithoutTZType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths

2021-06-10 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-35640.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32777
[https://github.com/apache/spark/pull/32777]

> Refactor Parquet vectorized reader to remove duplicated code paths
> --
>
> Key: SPARK-35640
> URL: https://issues.apache.org/jira/browse/SPARK-35640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently in Parquet vectorized code path, there are many code duplications 
> such as the following:
> {code:java}
>   public void readIntegers(
>   int total,
>   WritableColumnVector c,
>   int rowId,
>   int level,
>   VectorizedValuesReader data) throws IOException {
> int left = total;
> while (left > 0) {
>   if (this.currentCount == 0) this.readNextGroup();
>   int n = Math.min(left, this.currentCount);
>   switch (mode) {
> case RLE:
>   if (currentValue == level) {
> data.readIntegers(n, c, rowId);
>   } else {
> c.putNulls(rowId, n);
>   }
>   break;
> case PACKED:
>   for (int i = 0; i < n; ++i) {
> if (currentBuffer[currentBufferIdx++] == level) {
>   c.putInt(rowId + i, data.readInteger());
> } else {
>   c.putNull(rowId + i);
> }
>   }
>   break;
>   }
>   rowId += n;
>   left -= n;
>   currentCount -= n;
> }
>   }
> {code}
> This makes it hard to maintain as any change on this will need to be 
> replicated in 20+ places. The issue becomes more serious when we are going to 
> implement column index and complex type support for the vectorized path.
> The original intention is for performance. However now days JIT compilers 
> tend to be smart on this and will inline virtual calls as much as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35722) wait until something does get queued

2021-06-10 Thread yikf (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf updated SPARK-35722:
-
Summary: wait until something does get queued  (was: wait until until 
something does get queued)

> wait until something does get queued
> 
>
> Key: SPARK-35722
> URL: https://issues.apache.org/jira/browse/SPARK-35722
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.2.0
>
>
>  if nothing has been added to the queue, should wait until something does get 
> queued instead of loop after timeout.
> It prevents an ineffective cycle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35721) Path level discover for python unittests

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35721:


Assignee: Apache Spark

> Path level discover for python unittests
> 
>
> Key: SPARK-35721
> URL: https://issues.apache.org/jira/browse/SPARK-35721
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>
> Now we need to specify the python test cases by manually when we add a new 
> testcase. Sometime, we forgot to add the testcase to module list, the 
> testcase would not be executed.
> Such as:
>  * pyspark-core pyspark.tests.test_pin_thread
> Thus we need some auto-discover way to find all testcase rather than 
> specified every case by manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35722) wait until until something does get queued

2021-06-10 Thread yikf (Jira)

yikf created SPARK-35722:


 Summary: wait until until something does get queued
 Key: SPARK-35722
 URL: https://issues.apache.org/jira/browse/SPARK-35722
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: yikf
 Fix For: 3.2.0


 if nothing has been added to the queue, should wait until something does get 
queued instead of loop after timeout.

It prevents an ineffective cycle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35721) Path level discover for python unittests

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35721:


Assignee: (was: Apache Spark)

> Path level discover for python unittests
> 
>
> Key: SPARK-35721
> URL: https://issues.apache.org/jira/browse/SPARK-35721
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> Now we need to specify the python test cases by manually when we add a new 
> testcase. Sometime, we forgot to add the testcase to module list, the 
> testcase would not be executed.
> Such as:
>  * pyspark-core pyspark.tests.test_pin_thread
> Thus we need some auto-discover way to find all testcase rather than 
> specified every case by manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35721) Path level discover for python unittests

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361397#comment-17361397
 ] 

Apache Spark commented on SPARK-35721:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32867

> Path level discover for python unittests
> 
>
> Key: SPARK-35721
> URL: https://issues.apache.org/jira/browse/SPARK-35721
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> Now we need to specify the python test cases by manually when we add a new 
> testcase. Sometime, we forgot to add the testcase to module list, the 
> testcase would not be executed.
> Such as:
>  * pyspark-core pyspark.tests.test_pin_thread
> Thus we need some auto-discover way to find all testcase rather than 
> specified every case by manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35721) Path level discover for python unittests

2021-06-10 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-35721:
---

 Summary: Path level discover for python unittests
 Key: SPARK-35721
 URL: https://issues.apache.org/jira/browse/SPARK-35721
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.2.0
Reporter: Yikun Jiang


Now we need to specify the python test cases by manually when we add a new 
testcase. Sometime, we forgot to add the testcase to module list, the testcase 
would not be executed.

Such as:
 * pyspark-core pyspark.tests.test_pin_thread

Thus we need some auto-discover way to find all testcase rather than specified 
every case by manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34512) Disable validate default values when parsing Avro schemas

2021-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34512.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32750
[https://github.com/apache/spark/pull/32750]

> Disable validate default values when parsing Avro schemas
> -
>
> Key: SPARK-34512
> URL: https://issues.apache.org/jira/browse/SPARK-34512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> This is a regression problem. How to reproduce this issue:
> {code:scala}
>   // Add this test to HiveSerDeReadWriteSuite
>   test("SPARK-34512") {
> withTable("t1") {
>   hiveClient.runSqlHive(
> """
>   |CREATE TABLE t1
>   |  ROW FORMAT SERDE
>   |  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>   |  STORED AS INPUTFORMAT
>   |  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>   |  OUTPUTFORMAT
>   |  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>   |  TBLPROPERTIES (
>   |'avro.schema.literal'='{
>   |  "namespace": "org.apache.spark.sql.hive.test",
>   |  "name": "schema_with_default_value",
>   |  "type": "record",
>   |  "fields": [
>   | {
>   |   "name": "ARRAY_WITH_DEFAULT",
>   |   "type": {"type": "array", "items": "string"},
>   |   "default": null
>   | }
>   |   ]
>   |}')
>   |""".stripMargin)
>   spark.sql("select * from t1").show
> }
>   }
> {code}
> {noformat}
> org.apache.avro.AvroTypeException: Invalid default for field 
> ARRAY_WITH_DEFAULT: null not a {"type":"array","items":"string"}
>   at org.apache.avro.Schema.validateDefault(Schema.java:1571)
>   at org.apache.avro.Schema.access$500(Schema.java:87)
>   at org.apache.avro.Schema$Field.(Schema.java:544)
>   at org.apache.avro.Schema.parse(Schema.java:1678)
>   at org.apache.avro.Schema$Parser.parse(Schema.java:1425)
>   at org.apache.avro.Schema$Parser.parse(Schema.java:1413)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFor(AvroSerdeUtils.java:268)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:111)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:187)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:107)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83)
>   at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:450)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:437)
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281)
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:263)
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:641)
>   at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:624)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:831)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:867)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:820)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223)
>   at 
>

[jira] [Assigned] (SPARK-35703) Remove HashClusteredDistribution

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35703:


Assignee: Apache Spark

> Remove HashClusteredDistribution
> 
>
> Key: SPARK-35703
> URL: https://issues.apache.org/jira/browse/SPARK-35703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Currently Spark has {{HashClusteredDistribution}} and 
> {{ClusteredDistribution}}. The only difference between the two is that the 
> former is more strict when deciding whether bucket join is allowed to avoid 
> shuffle: comparing to the latter, it requires *exact* match between the 
> clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and 
> the join keys. However, this is unnecessary, as we should be able to avoid 
> shuffle when the set of clustering keys is a subset of join keys, just like 
> {{ClusteredDistribution}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35703) Remove HashClusteredDistribution

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361369#comment-17361369
 ] 

Apache Spark commented on SPARK-35703:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/32875

> Remove HashClusteredDistribution
> 
>
> Key: SPARK-35703
> URL: https://issues.apache.org/jira/browse/SPARK-35703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark has {{HashClusteredDistribution}} and 
> {{ClusteredDistribution}}. The only difference between the two is that the 
> former is more strict when deciding whether bucket join is allowed to avoid 
> shuffle: comparing to the latter, it requires *exact* match between the 
> clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and 
> the join keys. However, this is unnecessary, as we should be able to avoid 
> shuffle when the set of clustering keys is a subset of join keys, just like 
> {{ClusteredDistribution}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35703) Remove HashClusteredDistribution

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35703:


Assignee: (was: Apache Spark)

> Remove HashClusteredDistribution
> 
>
> Key: SPARK-35703
> URL: https://issues.apache.org/jira/browse/SPARK-35703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark has {{HashClusteredDistribution}} and 
> {{ClusteredDistribution}}. The only difference between the two is that the 
> former is more strict when deciding whether bucket join is allowed to avoid 
> shuffle: comparing to the latter, it requires *exact* match between the 
> clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and 
> the join keys. However, this is unnecessary, as we should be able to avoid 
> shuffle when the set of clustering keys is a subset of join keys, just like 
> {{ClusteredDistribution}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34512) Disable validate default values when parsing Avro schemas

2021-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34512:
-

Assignee: Yuming Wang

> Disable validate default values when parsing Avro schemas
> -
>
> Key: SPARK-34512
> URL: https://issues.apache.org/jira/browse/SPARK-34512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> This is a regression problem. How to reproduce this issue:
> {code:scala}
>   // Add this test to HiveSerDeReadWriteSuite
>   test("SPARK-34512") {
> withTable("t1") {
>   hiveClient.runSqlHive(
> """
>   |CREATE TABLE t1
>   |  ROW FORMAT SERDE
>   |  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>   |  STORED AS INPUTFORMAT
>   |  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>   |  OUTPUTFORMAT
>   |  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>   |  TBLPROPERTIES (
>   |'avro.schema.literal'='{
>   |  "namespace": "org.apache.spark.sql.hive.test",
>   |  "name": "schema_with_default_value",
>   |  "type": "record",
>   |  "fields": [
>   | {
>   |   "name": "ARRAY_WITH_DEFAULT",
>   |   "type": {"type": "array", "items": "string"},
>   |   "default": null
>   | }
>   |   ]
>   |}')
>   |""".stripMargin)
>   spark.sql("select * from t1").show
> }
>   }
> {code}
> {noformat}
> org.apache.avro.AvroTypeException: Invalid default for field 
> ARRAY_WITH_DEFAULT: null not a {"type":"array","items":"string"}
>   at org.apache.avro.Schema.validateDefault(Schema.java:1571)
>   at org.apache.avro.Schema.access$500(Schema.java:87)
>   at org.apache.avro.Schema$Field.(Schema.java:544)
>   at org.apache.avro.Schema.parse(Schema.java:1678)
>   at org.apache.avro.Schema$Parser.parse(Schema.java:1425)
>   at org.apache.avro.Schema$Parser.parse(Schema.java:1413)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFor(AvroSerdeUtils.java:268)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:111)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:187)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:107)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83)
>   at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:450)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:437)
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281)
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:263)
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:641)
>   at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:624)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:831)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:867)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:820)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:800)
>   at 
>

[jira] [Commented] (SPARK-35703) Remove HashClusteredDistribution

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361371#comment-17361371
 ] 

Apache Spark commented on SPARK-35703:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/32875

> Remove HashClusteredDistribution
> 
>
> Key: SPARK-35703
> URL: https://issues.apache.org/jira/browse/SPARK-35703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark has {{HashClusteredDistribution}} and 
> {{ClusteredDistribution}}. The only difference between the two is that the 
> former is more strict when deciding whether bucket join is allowed to avoid 
> shuffle: comparing to the latter, it requires *exact* match between the 
> clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and 
> the join keys. However, this is unnecessary, as we should be able to avoid 
> shuffle when the set of clustering keys is a subset of join keys, just like 
> {{ClusteredDistribution}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35623) Volcano resource manager for Spark on Kubernetes

2021-06-10 Thread Kevin Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361368#comment-17361368
 ] 

Kevin Su commented on SPARK-35623:
--

[~dipanjanK] Thanks for proposing this feature.

I'm interested to contribute this feature.

> Volcano resource manager for Spark on Kubernetes
> 
>
> Key: SPARK-35623
> URL: https://issues.apache.org/jira/browse/SPARK-35623
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 3.1.1, 3.1.2
>Reporter: Dipanjan Kailthya
>Priority: Minor
>  Labels: kubernetes, resourcemanager
>
> Dear Spark Developers, 
>   
>  Hello from the Netherlands! Posting this here as I still haven't gotten 
> accepted to post in the spark dev mailing list.
>   
>  My team is planning to use spark with Kubernetes support on our shared 
> (multi-tenant) on premise Kubernetes cluster. However we would like to have 
> certain scheduling features like fair-share and preemption which as we 
> understand are not built into the current spark-kubernetes resource manager 
> yet. We have been working on and are close to a first successful prototype 
> integration with Volcano ([https://volcano.sh/en/docs/]). Briefly this means 
> a new resource manager component with lots in common with existing 
> spark-kubernetes resource manager, but instead of pods it launches Volcano 
> jobs which delegate the driver and executor pod creation and lifecycle 
> management to Volcano. We are interested in contributing this to open source, 
> either directly in spark or as a separate project.
>   
>  So, two questions: 
>   
>  1. Do the spark maintainers see this as a valuable contribution to the 
> mainline spark codebase? If so, can we have some guidance on how to publish 
> the changes? 
>   
>  2. Are any other developers / organizations interested to contribute to this 
> effort? If so, please get in touch.
>   
>  Best,
>  Dipanjan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35699) Improve error message when creating k8s pod failed.

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35699:


Assignee: (was: Apache Spark)

> Improve error message when creating k8s pod failed.
> ---
>
> Key: SPARK-35699
> URL: https://issues.apache.org/jira/browse/SPARK-35699
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.2
>Reporter: Kevin Su
>Priority: Minor
>
>  
> If I use wrong k8s master URL, I will get the below error message.
> I think we could improve error message, so end-users can get a better 
> understanding of the message.
> {code:java}
> (base) ➜ spark git:(master) ./bin/spark-submit \
> --master k8s://https://192.168.49.3:8443 \
> --name spark-pi \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.executor.instances=3 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.container.image=pingsutw/spark:testing \
> local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar
> 21/06/09 20:50:37 WARN Utils: Your hostname, kobe-pc resolves to a loopback 
> address: 127.0.1.1; using 192.168.103.20 instead (on interface ens160)
> 21/06/09 20:50:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 21/06/09 20:50:38 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 21/06/09 20:50:38 INFO SparkKubernetesClientFactory: Auto-configuring K8S 
> client using current context from users K8S config file
> 21/06/09 20:50:39 INFO KerberosConfDriverFeatureStep: You have not specified 
> a krb5.conf file locally or via a ConfigMap. Make sure that you have the 
> krb5.conf locally on the driver image.
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] 
> for kind: [Pod] with name: [null] in namespace: [default] failed.
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>  at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>  at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
>  at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
>  at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35699) Improve error message when creating k8s pod failed.

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35699:


Assignee: Apache Spark

> Improve error message when creating k8s pod failed.
> ---
>
> Key: SPARK-35699
> URL: https://issues.apache.org/jira/browse/SPARK-35699
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.2
>Reporter: Kevin Su
>Assignee: Apache Spark
>Priority: Minor
>
>  
> If I use wrong k8s master URL, I will get the below error message.
> I think we could improve error message, so end-users can get a better 
> understanding of the message.
> {code:java}
> (base) ➜ spark git:(master) ./bin/spark-submit \
> --master k8s://https://192.168.49.3:8443 \
> --name spark-pi \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.executor.instances=3 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.container.image=pingsutw/spark:testing \
> local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar
> 21/06/09 20:50:37 WARN Utils: Your hostname, kobe-pc resolves to a loopback 
> address: 127.0.1.1; using 192.168.103.20 instead (on interface ens160)
> 21/06/09 20:50:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 21/06/09 20:50:38 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 21/06/09 20:50:38 INFO SparkKubernetesClientFactory: Auto-configuring K8S 
> client using current context from users K8S config file
> 21/06/09 20:50:39 INFO KerberosConfDriverFeatureStep: You have not specified 
> a krb5.conf file locally or via a ConfigMap. Make sure that you have the 
> krb5.conf locally on the driver image.
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] 
> for kind: [Pod] with name: [null] in namespace: [default] failed.
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>  at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>  at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
>  at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
>  at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35699) Improve error message when creating k8s pod failed.

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361366#comment-17361366
 ] 

Apache Spark commented on SPARK-35699:
--

User 'pingsutw' has created a pull request for this issue:
https://github.com/apache/spark/pull/32874

> Improve error message when creating k8s pod failed.
> ---
>
> Key: SPARK-35699
> URL: https://issues.apache.org/jira/browse/SPARK-35699
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.2
>Reporter: Kevin Su
>Priority: Minor
>
>  
> If I use wrong k8s master URL, I will get the below error message.
> I think we could improve error message, so end-users can get a better 
> understanding of the message.
> {code:java}
> (base) ➜ spark git:(master) ./bin/spark-submit \
> --master k8s://https://192.168.49.3:8443 \
> --name spark-pi \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.executor.instances=3 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.container.image=pingsutw/spark:testing \
> local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar
> 21/06/09 20:50:37 WARN Utils: Your hostname, kobe-pc resolves to a loopback 
> address: 127.0.1.1; using 192.168.103.20 instead (on interface ens160)
> 21/06/09 20:50:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 21/06/09 20:50:38 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 21/06/09 20:50:38 INFO SparkKubernetesClientFactory: Auto-configuring K8S 
> client using current context from users K8S config file
> 21/06/09 20:50:39 INFO KerberosConfDriverFeatureStep: You have not specified 
> a krb5.conf file locally or via a ConfigMap. Make sure that you have the 
> krb5.conf locally on the driver image.
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] 
> for kind: [Pod] with name: [null] in namespace: [default] failed.
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>  at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>  at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
>  at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
>  at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35699) Improve error message when creating k8s pod failed.

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361363#comment-17361363
 ] 

Apache Spark commented on SPARK-35699:
--

User 'pingsutw' has created a pull request for this issue:
https://github.com/apache/spark/pull/32874

> Improve error message when creating k8s pod failed.
> ---
>
> Key: SPARK-35699
> URL: https://issues.apache.org/jira/browse/SPARK-35699
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.2
>Reporter: Kevin Su
>Priority: Minor
>
>  
> If I use wrong k8s master URL, I will get the below error message.
> I think we could improve error message, so end-users can get a better 
> understanding of the message.
> {code:java}
> (base) ➜ spark git:(master) ./bin/spark-submit \
> --master k8s://https://192.168.49.3:8443 \
> --name spark-pi \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.executor.instances=3 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.container.image=pingsutw/spark:testing \
> local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar
> 21/06/09 20:50:37 WARN Utils: Your hostname, kobe-pc resolves to a loopback 
> address: 127.0.1.1; using 192.168.103.20 instead (on interface ens160)
> 21/06/09 20:50:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 21/06/09 20:50:38 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 21/06/09 20:50:38 INFO SparkKubernetesClientFactory: Auto-configuring K8S 
> client using current context from users K8S config file
> 21/06/09 20:50:39 INFO KerberosConfDriverFeatureStep: You have not specified 
> a krb5.conf file locally or via a ConfigMap. Make sure that you have the 
> krb5.conf locally on the driver image.
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] 
> for kind: [Pod] with name: [null] in namespace: [default] failed.
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>  at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>  at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
>  at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
>  at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35056) Group exception messages in execution/streaming

2021-06-10 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361362#comment-17361362
 ] 

jiaan.geng commented on SPARK-35056:


I'm working on.

> Group exception messages in execution/streaming
> ---
>
> Key: SPARK-35056
> URL: https://issues.apache.org/jira/browse/SPARK-35056
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/core/src/main/scala/org/apache/spark/sql/execution/streaming'
> || Filename ||   Count ||
> | CheckpointFileManager.scala  |   3 |
> | CommitLog.scala  |   1 |
> | CompactibleFileStreamLog.scala   |   2 |
> | FileStreamOptions.scala  |   1 |
> | FileStreamSink.scala |   1 |
> | GroupStateImpl.scala |   5 |
> | HDFSMetadataLog.scala|   4 |
> | ManifestFileCommitProtocol.scala |   1 |
> | MicroBatchExecution.scala|   5 |
> | OffsetSeqLog.scala   |   1 |
> | Sink.scala   |   3 |
> | Source.scala |   3 |
> | StreamingQueryWrapper.scala  |   1 |
> | StreamingRelation.scala  |   1 |
> | StreamingSymmetricHashJoinExec.scala |   2 |
> | Triggers.scala   |   1 |
> | WatermarkTracker.scala   |   1 |
> | memory.scala |   4 |
> | statefulOperators.scala  |   2 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35718) Support casting of Date to timestamp without time zone type

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35718:


Assignee: Gengliang Wang  (was: Apache Spark)

> Support casting of Date to timestamp without time zone type
> ---
>
> Key: SPARK-35718
> URL: https://issues.apache.org/jira/browse/SPARK-35718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Extend the Cast expression and support DateType in casting to 
> TimestampWithoutTZType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35718) Support casting of Date to timestamp without time zone type

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35718:


Assignee: Apache Spark  (was: Gengliang Wang)

> Support casting of Date to timestamp without time zone type
> ---
>
> Key: SPARK-35718
> URL: https://issues.apache.org/jira/browse/SPARK-35718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Extend the Cast expression and support DateType in casting to 
> TimestampWithoutTZType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35718) Support casting of Date to timestamp without time zone type

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361352#comment-17361352
 ] 

Apache Spark commented on SPARK-35718:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32873

> Support casting of Date to timestamp without time zone type
> ---
>
> Key: SPARK-35718
> URL: https://issues.apache.org/jira/browse/SPARK-35718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Extend the Cast expression and support DateType in casting to 
> TimestampWithoutTZType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35718) Support casting of Date to timestamp without time zone type

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361353#comment-17361353
 ] 

Apache Spark commented on SPARK-35718:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32873

> Support casting of Date to timestamp without time zone type
> ---
>
> Key: SPARK-35718
> URL: https://issues.apache.org/jira/browse/SPARK-35718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Extend the Cast expression and support DateType in casting to 
> TimestampWithoutTZType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35639) Add metrics about coalesced partitions to CustomShuffleReader in AQE

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361351#comment-17361351
 ] 

Apache Spark commented on SPARK-35639:
--

User 'ekoifman' has created a pull request for this issue:
https://github.com/apache/spark/pull/32872

> Add metrics about coalesced partitions to CustomShuffleReader in AQE
> 
>
> Key: SPARK-35639
> URL: https://issues.apache.org/jira/browse/SPARK-35639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> {{CustomShuffleReaderExec}} reports "number of skewed partitions" and "number 
> of skewed partition splits".
>  It would be useful to also report "number of partitions to coalesce" and 
> "number of coalesced partitions" and include this in string rendering of the 
> SparkPlan node so that it looks like this
> {code:java}
> (12) CustomShuffleReader
> Input [2]: [a#23, b#24]
> Arguments: coalesced 3 partitions into 1 and split 2 skewed partitions into 4
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35720) Support casting of String to timestamp without time zone type

2021-06-10 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-35720:
--

 Summary: Support casting of String to timestamp without time zone 
type
 Key: SPARK-35720
 URL: https://issues.apache.org/jira/browse/SPARK-35720
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Extend the Cast expression and support  in casting StringType 
toTimestampWithoutTZType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35719) Support casting of timestamp to timestamp without time zone type

2021-06-10 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-35719:
--

 Summary: Support casting of timestamp to timestamp without time 
zone type
 Key: SPARK-35719
 URL: https://issues.apache.org/jira/browse/SPARK-35719
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Extend the Cast expression and support TimestampType in casting to 
TimestampWithoutTZType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35718) Support casting of Date to timestamp without time zone type

2021-06-10 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-35718:
--

 Summary: Support casting of Date to timestamp without time zone 
type
 Key: SPARK-35718
 URL: https://issues.apache.org/jira/browse/SPARK-35718
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Extend the Cast expression and support DateType in casting to 
TimestampWithoutTZType




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35475) Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace.

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361311#comment-17361311
 ] 

Apache Spark commented on SPARK-35475:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/32871

> Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace.
> -
>
> Key: SPARK-35475
> URL: https://issues.apache.org/jira/browse/SPARK-35475
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2021-06-10 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361310#comment-17361310
 ] 

Jungtaek Lim commented on SPARK-24295:
--

[~parshabani]
Please follow up SPARK-27188.

But there're lots of other problems in file stream source/sink, and they 
require major efforts to deal with. 3rd party data lake solutions (e.g. Apache 
Hudi, Apache Iceberg, Delta Lake sorted by alphabetially) are trying to deal 
with these problems (and they already resolved known issues) so you may want to 
look into these solutions.

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
> Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35475) Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace.

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361309#comment-17361309
 ] 

Apache Spark commented on SPARK-35475:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/32871

> Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace.
> -
>
> Key: SPARK-35475
> URL: https://issues.apache.org/jira/browse/SPARK-35475
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35475) Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace.

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35475:


Assignee: Apache Spark

> Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace.
> -
>
> Key: SPARK-35475
> URL: https://issues.apache.org/jira/browse/SPARK-35475
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35475) Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace.

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35475:


Assignee: (was: Apache Spark)

> Enable disallow_untyped_defs mypy check for pyspark.pandas.namespace.
> -
>
> Key: SPARK-35475
> URL: https://issues.apache.org/jira/browse/SPARK-35475
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35709) Remove links to Nomad integration in the Documentation

2021-06-10 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-35709.
--
Fix Version/s: 3.2.0
 Assignee: Paweł Ptaszyński
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/32860

> Remove links to Nomad integration in the Documentation
> --
>
> Key: SPARK-35709
> URL: https://issues.apache.org/jira/browse/SPARK-35709
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Paweł Ptaszyński
>Assignee: Paweł Ptaszyński
>Priority: Trivial
> Fix For: 3.2.0
>
>
> In Project documentation `docs/cluster-overview.md` there is still active 
> reference to the [Nomad integration project by 
> Hashicorp|https://github.com/hashicorp/nomad-spark]. 
> As per the repository project is no longer active. The task is to remove the 
> reference from the Documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361298#comment-17361298
 ] 

Apache Spark commented on SPARK-35439:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/32870

> Children subexpr should come first than parent subexpr in subexpression 
> elimination
> ---
>
> Key: SPARK-35439
> URL: https://issues.apache.org/jira/browse/SPARK-35439
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> EquivalentExpressions maintains a map of equivalent expressions. It is 
> HashMap now so the insertion order is not guaranteed to be preserved later. 
> Subexpression elimination relies on retrieving subexpressions from the map. 
> If there is child-parent relationships among the subexpressions, we want the 
> child expressions come first than parent expressions, so we can replace child 
> expressions in parent expressions with subexpression evaluation.
> For example, we have two different expressions Add(Literal(1), Literal(2)) 
> and Add(Literal(3), add).
> Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can 
> deal with it.
> addExprTree(add)
> addExprTree(Add(Literal(3), add))
> addExprTree(Add(Literal(3), add))
> Case 2: parent subexpr comes first. For this case, we need to sort equivalent 
> expressions.
> addExprTree(Add(Literal(3), add))  => We add `Add(Literal(3), add)` into the 
> map first, then add `add` into the map
> addExprTree(add)
> addExprTree(Add(Literal(3), add))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35717) pandas_udf crashes in conjunction with .filter()

2021-06-10 Thread F. H. (Jira)

F. H. created SPARK-35717:
-

 Summary: pandas_udf crashes in conjunction with .filter()
 Key: SPARK-35717
 URL: https://issues.apache.org/jira/browse/SPARK-35717
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.2, 3.1.1, 3.0.0
 Environment: Centos 8 with PySpark from conda
Reporter: F. H.


I wrote the following UDF that always returns some "byte"-type array:

 
{code:python}
from typing import Iterator

@f.pandas_udf(returnType=t.ByteType())
def spark_gt_mapping_fn(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
mapping = dict()
mapping[(-1, -1)] = -1
mapping[(0, 0)] = 0
mapping[(0, 1)] = 1
mapping[(1, 0)] = 1
mapping[(1, 1)] = 2

def gt_mapping_fn(v):
if len(v) != 2:
return -3
else:
a, b = v
return mapping.get((a, b), -2)

for x in batch_iter:
 yield x.apply(gt_mapping_fn).astype("int8")
{code}
 

However, every time I'd like to filter on the resulting column, I get the 
following error:
{code:python}
# works:
(
df
.select(spark_gt_mapping_fn(f.col("genotype.calls")).alias("GT"))
.limit(10).toPandas()
)
# fails:
(
df
.select(spark_gt_mapping_fn(f.col("genotype.calls")).alias("GT"))
.filter("GT > 0")
.limit(10).toPandas()
)
{code}
{code:java}
Py4JJavaError: An error occurred while calling o672.collectToPython. : 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 
125) (ouga05.cmm.in.tum.de executor driver): 
org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by 
query. Memory leaked: (16384) Allocator(stdin reader for python3) 
0/16384/34816/9223372036854775807 (res/actual/peak/limit) at 
org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:145) at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124) 
at org.apache.spark.scheduler.Task.run(Task.scala:147) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
 at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206) at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
 at scala.Option.foreach(Option.scala:407) at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2196) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2217) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2236) at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:472) at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:425) at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) 
at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519) 
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687) at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at

[jira] [Commented] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361297#comment-17361297
 ] 

Apache Spark commented on SPARK-35439:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/32870

> Children subexpr should come first than parent subexpr in subexpression 
> elimination
> ---
>
> Key: SPARK-35439
> URL: https://issues.apache.org/jira/browse/SPARK-35439
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> EquivalentExpressions maintains a map of equivalent expressions. It is 
> HashMap now so the insertion order is not guaranteed to be preserved later. 
> Subexpression elimination relies on retrieving subexpressions from the map. 
> If there is child-parent relationships among the subexpressions, we want the 
> child expressions come first than parent expressions, so we can replace child 
> expressions in parent expressions with subexpression evaluation.
> For example, we have two different expressions Add(Literal(1), Literal(2)) 
> and Add(Literal(3), add).
> Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can 
> deal with it.
> addExprTree(add)
> addExprTree(Add(Literal(3), add))
> addExprTree(Add(Literal(3), add))
> Case 2: parent subexpr comes first. For this case, we need to sort equivalent 
> expressions.
> addExprTree(Add(Literal(3), add))  => We add `Add(Literal(3), add)` into the 
> map first, then add `add` into the map
> addExprTree(add)
> addExprTree(Add(Literal(3), add))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35616) Make astype data-type-based

2021-06-10 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-35616:
-
Description: There are many type checks in `astype` methods. Since 
`DataTypeOps` is introduced, we should refactor `astype` to make it 
data-type-based.

> Make astype data-type-based
> ---
>
> Key: SPARK-35616
> URL: https://issues.apache.org/jira/browse/SPARK-35616
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> There are many type checks in `astype` methods. Since `DataTypeOps` is 
> introduced, we should refactor `astype` to make it data-type-based.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35593) Support shuffle data recovery on the reused PVCs

2021-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35593.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32730
[https://github.com/apache/spark/pull/32730]

> Support shuffle data recovery on the reused PVCs
> 
>
> Key: SPARK-35593
> URL: https://issues.apache.org/jira/browse/SPARK-35593
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35593) Support shuffle data recovery on the reused PVCs

2021-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35593:
-

Assignee: Dongjoon Hyun

> Support shuffle data recovery on the reused PVCs
> 
>
> Key: SPARK-35593
> URL: https://issues.apache.org/jira/browse/SPARK-35593
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2021-06-10 Thread Parsa Shabani (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361245#comment-17361245
 ] 

Parsa Shabani edited comment on SPARK-24295 at 6/10/21, 10:43 PM:
--

Hello all,

Any updates on this? [~kabhwan] are you still planning to merge your PR?

We are having severe issues with the growing size of the .compact files. 

The functionality for the purging or truncating records for older syncs is 
badly needed in my opinion.

 


was (Author: parshabani):
Hello all,

Any updates on this? [~kabhwan] are you still planning to merge your PR?

We are having severe issues with the growing size of the .compact files. 

The functionality for the purging and truncating records for older syncs is 
badly needed in my opinion.

 

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
> Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2021-06-10 Thread Parsa Shabani (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361245#comment-17361245
 ] 

Parsa Shabani commented on SPARK-24295:
---

Hello all,

Any updates on this? [~kabhwan] are you still planning to merge your PR?

We are having severe issues with the growing size of the .compact files. 

The functionality for the purging and truncating records for older syncs is 
badly needed in my opinion.

 

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
> Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33350) Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data

2021-06-10 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-33350:
---

Assignee: Ye Zhou

> Add support to DiskBlockManager to create merge directory and to get the 
> local shuffle merged data
> --
>
> Key: SPARK-33350
> URL: https://issues.apache.org/jira/browse/SPARK-33350
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Assignee: Ye Zhou
>Priority: Major
>
> DiskBlockManager should be able to create the {{merge_manager}} directory, 
> where the push-based merged shuffle files are written and also create 
> sub-dirs under it. 
> It should also be able to serve the local merged shuffle data/index/meta 
> files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33350) Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data

2021-06-10 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-33350.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32007
[https://github.com/apache/spark/pull/32007]

> Add support to DiskBlockManager to create merge directory and to get the 
> local shuffle merged data
> --
>
> Key: SPARK-33350
> URL: https://issues.apache.org/jira/browse/SPARK-33350
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Assignee: Ye Zhou
>Priority: Major
> Fix For: 3.2.0
>
>
> DiskBlockManager should be able to create the {{merge_manager}} directory, 
> where the push-based merged shuffle files are written and also create 
> sub-dirs under it. 
> It should also be able to serve the local merged shuffle data/index/meta 
> files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35692) Use int to replace long for EXECUTOR_ID_COUNTER in Kubernetes

2021-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35692.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32837
[https://github.com/apache/spark/pull/32837]

> Use int to replace long for EXECUTOR_ID_COUNTER in Kubernetes
> -
>
> Key: SPARK-35692
> URL: https://issues.apache.org/jira/browse/SPARK-35692
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.2.0
>
>
> In Kubernetes deployment, EXECUTOR_ID_COUNTER can be int like other cluster 
> managers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35692) Use int to replace long for EXECUTOR_ID_COUNTER in Kubernetes

2021-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35692:
-

Assignee: Kent Yao

> Use int to replace long for EXECUTOR_ID_COUNTER in Kubernetes
> -
>
> Key: SPARK-35692
> URL: https://issues.apache.org/jira/browse/SPARK-35692
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> In Kubernetes deployment, EXECUTOR_ID_COUNTER can be int like other cluster 
> managers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35716) Support casting of timestamp without time zone to date type

2021-06-10 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35716.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32869
[https://github.com/apache/spark/pull/32869]

> Support casting of timestamp without time zone to date type
> ---
>
> Key: SPARK-35716
> URL: https://issues.apache.org/jira/browse/SPARK-35716
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Extend the Cast expression and support TimestampWithoutTZType in casting to 
> DateType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2021-06-10 Thread binwei yang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361175#comment-17361175
 ] 

binwei yang commented on SPARK-24579:
-

We implemented the native Apache Arrow data source and Arrow to DMatrix 
conversion natively. The solution can decrease the data preparing time to a 
negligible level. This is actually more important to inference than training. 
The link has the details.

[https://medium.com/intel-analytics-software/optimizing-the-end-to-end-training-pipeline-on-apache-spark-clusters-80261d6a7b8c]

During the implementation, there are 3 points we need to improve from Spark 
side:
 * There are different columnarBatch define in Spark. Now Arrow is mature and 
data format is stable. Why don't we use Arrow's ColumnarBatch directly? Like 
RDD cache.

 * We still don't have an easy way to get RDD[ColumnarBatch]. Here is what we 
use: 

{code:java}
val qe = new QueryExecution(df.sparkSession, df.queryExecution.logical) 
{
  override protected def preparations: Seq[Rule[SparkPlan]] = {
Seq(
  PlanSubqueries(sparkSession),
  EnsureRequirements(sparkSession.sessionState.conf),
  CollapseCodegenStages(sparkSession.sessionState.conf),
)
  }
}

val plan = qe.executedPlan

val rdd: RDD[ColumnarBatch] = plan.executeColumnar()
(rdd.map {...})
{code}
Some way like this can be much easier to use
{code:java}
RDD[ColumnarBatch] columnardf = df.getColumnarDF()
//or
df.mapColumnarBatch()
 {code}
 *  (Not related to this JIRA). XGBoost uses TBB mode, so we need to change 
task.cpus by stage level resource management APIs, which is good. But we still 
need to easily collect data from each task threads in the same executor. 
Ideally we should avoid the memory copy during the process. What we did now is 
like below. We created a customized CoalescePartitioner which combines the 
dmatrix pointers in the same executor, then concat them. I'm think some more 
general way to do this.

{code:java}
dmatrixpointerrdd.cache()
dmatrixpointerrdd.foreachPartition(() =>)
val coalescedrdd = dmatrixpointerrdd.coalesce(1,
partitionCoalescer = Some(new ExecutorInProcessCoalescePartitioner()))
coalescedrdd.mapPartitions (...){code}
  

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can

[jira] [Resolved] (SPARK-32920) Add support in Spark driver to coordinate the finalization of the push/merge phase in push-based shuffle for a given shuffle and the initiation of the reduce stage

2021-06-10 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32920.
-
Resolution: Fixed

> Add support in Spark driver to coordinate the finalization of the push/merge 
> phase in push-based shuffle for a given shuffle and the initiation of the 
> reduce stage
> ---
>
> Key: SPARK-32920
> URL: https://issues.apache.org/jira/browse/SPARK-32920
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.2.0
>
>
> With push-based shuffle, we are currently decoupling map task executions from 
> the shuffle block push process. Thus, when all map tasks finish, we might 
> want to wait for some small extra time to allow more shuffle blocks to get 
> pushed and merged. This requires some extra coordination in the Spark driver 
> when it transitions from a shuffle map stage to the corresponding reduce 
> stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32920) Add support in Spark driver to coordinate the finalization of the push/merge phase in push-based shuffle for a given shuffle and the initiation of the reduce stage

2021-06-10 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-32920:

Fix Version/s: 3.2.0

> Add support in Spark driver to coordinate the finalization of the push/merge 
> phase in push-based shuffle for a given shuffle and the initiation of the 
> reduce stage
> ---
>
> Key: SPARK-32920
> URL: https://issues.apache.org/jira/browse/SPARK-32920
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.2.0
>
>
> With push-based shuffle, we are currently decoupling map task executions from 
> the shuffle block push process. Thus, when all map tasks finish, we might 
> want to wait for some small extra time to allow more shuffle blocks to get 
> pushed and merged. This requires some extra coordination in the Spark driver 
> when it transitions from a shuffle map stage to the corresponding reduce 
> stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32920) Add support in Spark driver to coordinate the finalization of the push/merge phase in push-based shuffle for a given shuffle and the initiation of the reduce stage

2021-06-10 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-32920:
---

Assignee: Venkata krishnan Sowrirajan

> Add support in Spark driver to coordinate the finalization of the push/merge 
> phase in push-based shuffle for a given shuffle and the initiation of the 
> reduce stage
> ---
>
> Key: SPARK-32920
> URL: https://issues.apache.org/jira/browse/SPARK-32920
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
>
> With push-based shuffle, we are currently decoupling map task executions from 
> the shuffle block push process. Thus, when all map tasks finish, we might 
> want to wait for some small extra time to allow more shuffle blocks to get 
> pushed and merged. This requires some extra coordination in the Spark driver 
> when it transitions from a shuffle map stage to the corresponding reduce 
> stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32709) Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)

2021-06-10 Thread Shashank Pedamallu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361119#comment-17361119
 ] 

Shashank Pedamallu commented on SPARK-32709:


[~chengsu], thank you so much for the response. From my understanding, the only 
pending PR to get writing to hive bucketed tables to work is 
[https://github.com/apache/spark/pull/30003]

However, it does not seem to lift the guard check applied here: 
[https://github.com/apache/spark/blob/44b695fbb06b0d89783b4838941c68543c5a5c8b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L182-L184]
 what else is pending?

> Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)
> --
>
> Key: SPARK-32709
> URL: https://issues.apache.org/jira/browse/SPARK-32709
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Hive ORC/Parquet write code path is same as data source v1 code path 
> (FileFormatWriter). This JIRA is to add the support to write Hive ORC/Parquet 
> bucketed table with hivehash. The change is to custom `bucketIdExpression` to 
> use hivehash when the table is Hive bucketed table, and the Hive version is 
> 1.x.y or 2.x.y.
>  
> This will allow us write Hive/Presto-compatible bucketed table for Hive 1 and 
> 2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35716) Support casting of timestamp without time zone to date type

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361117#comment-17361117
 ] 

Apache Spark commented on SPARK-35716:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32869

> Support casting of timestamp without time zone to date type
> ---
>
> Key: SPARK-35716
> URL: https://issues.apache.org/jira/browse/SPARK-35716
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Extend the Cast expression and support TimestampWithoutTZType in casting to 
> DateType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35716) Support casting of timestamp without time zone to date type

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35716:


Assignee: Gengliang Wang  (was: Apache Spark)

> Support casting of timestamp without time zone to date type
> ---
>
> Key: SPARK-35716
> URL: https://issues.apache.org/jira/browse/SPARK-35716
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Extend the Cast expression and support TimestampWithoutTZType in casting to 
> DateType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35716) Support casting of timestamp without time zone to date type

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35716:


Assignee: Apache Spark  (was: Gengliang Wang)

> Support casting of timestamp without time zone to date type
> ---
>
> Key: SPARK-35716
> URL: https://issues.apache.org/jira/browse/SPARK-35716
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Extend the Cast expression and support TimestampWithoutTZType in casting to 
> DateType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35716) Support casting of timestamp without time zone to date type

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361116#comment-17361116
 ] 

Apache Spark commented on SPARK-35716:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32869

> Support casting of timestamp without time zone to date type
> ---
>
> Key: SPARK-35716
> URL: https://issues.apache.org/jira/browse/SPARK-35716
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Extend the Cast expression and support TimestampWithoutTZType in casting to 
> DateType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35716) Support casting of timestamp without time zone to date type

2021-06-10 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-35716:
--

 Summary: Support casting of timestamp without time zone to date 
type
 Key: SPARK-35716
 URL: https://issues.apache.org/jira/browse/SPARK-35716
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Extend the Cast expression and support TimestampWithoutTZType in casting to 
DateType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35296) Dataset.observe fails with an assertion

2021-06-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35296:
---

Assignee: Kousuke Saruta

> Dataset.observe fails with an assertion
> ---
>
> Key: SPARK-35296
> URL: https://issues.apache.org/jira/browse/SPARK-35296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Tanel Kiis
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.0.3, 3.2.0, 3.1.3
>
> Attachments: 2021-05-03_18-34.png
>
>
> I hit this assertion error when using dataset.observe:
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:208) ~[scala-library-2.12.10.jar:?]
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.setState(AggregatingAccumulator.scala:204)
>  ~[spark-sql_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.sql.execution.CollectMetricsExec.$anonfun$doExecute$2(CollectMetricsExec.scala:72)
>  ~[spark-sql_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.sql.execution.CollectMetricsExec.$anonfun$doExecute$2$adapted(CollectMetricsExec.scala:71)
>  ~[spark-sql_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:125) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1(TaskContextImpl.scala:124)
>  ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1$adapted(TaskContextImpl.scala:124)
>  ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1(TaskContextImpl.scala:137)
>  ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1$adapted(TaskContextImpl.scala:135)
>  ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> ~[scala-library-2.12.10.jar:?]
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) 
> ~[scala-library-2.12.10.jar:?]
>   at 
> org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:135) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at org.apache.spark.scheduler.Task.run(Task.scala:147) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>  ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) 
> [spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) 
> [spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_282]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_282]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
> {code}
> A workaround, that I used was to add .coalesce(1) before calling this method.
> It happens in a quite complex query and I have not been able to reproduce 
> this with a simpler query
> Added an screenshot of the debugger, at the moment of exception
>  !2021-05-03_18-34.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35296) Dataset.observe fails with an assertion

2021-06-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35296.
-
Fix Version/s: 3.0.3
   3.1.3
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 32786
[https://github.com/apache/spark/pull/32786]

> Dataset.observe fails with an assertion
> ---
>
> Key: SPARK-35296
> URL: https://issues.apache.org/jira/browse/SPARK-35296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
> Fix For: 3.2.0, 3.1.3, 3.0.3
>
> Attachments: 2021-05-03_18-34.png
>
>
> I hit this assertion error when using dataset.observe:
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:208) ~[scala-library-2.12.10.jar:?]
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.setState(AggregatingAccumulator.scala:204)
>  ~[spark-sql_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.sql.execution.CollectMetricsExec.$anonfun$doExecute$2(CollectMetricsExec.scala:72)
>  ~[spark-sql_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.sql.execution.CollectMetricsExec.$anonfun$doExecute$2$adapted(CollectMetricsExec.scala:71)
>  ~[spark-sql_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:125) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1(TaskContextImpl.scala:124)
>  ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1$adapted(TaskContextImpl.scala:124)
>  ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1(TaskContextImpl.scala:137)
>  ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1$adapted(TaskContextImpl.scala:135)
>  ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> ~[scala-library-2.12.10.jar:?]
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) 
> ~[scala-library-2.12.10.jar:?]
>   at 
> org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:135) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at org.apache.spark.scheduler.Task.run(Task.scala:147) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>  ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) 
> [spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) 
> [spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_282]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_282]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
> {code}
> A workaround, that I used was to add .coalesce(1) before calling this method.
> It happens in a quite complex query and I have not been able to reproduce 
> this with a simpler query
> Added an screenshot of the debugger, at the moment of exception
>  !2021-05-03_18-34.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34861) Support nested column in Spark vectorized readers

2021-06-10 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361090#comment-17361090
 ] 

Chao Sun commented on SPARK-34861:
--

Synced with [~chengsu] offline and I will take over this JIRA.

> Support nested column in Spark vectorized readers
> -
>
> Key: SPARK-34861
> URL: https://issues.apache.org/jira/browse/SPARK-34861
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> This is the umbrella task to track the overall progress. The task is to 
> support nested column type in Spark vectorized reader, namely Parquet and 
> ORC. Currently both Parquet and ORC vectorized readers do not support nested 
> column type (struct, array and map). We implemented nested column vectorized 
> reader for FB-ORC in our internal fork of Spark. We are seeing performance 
> improvement compared to non-vectorized reader when reading nested columns. In 
> addition, this can also help improve the non-nested column performance when 
> reading non-nested and nested columns together in one query.
>  
> Parquet: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173]
>  
> ORC:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35715) Option "--files" with local:// prefix is not honoured for Spark on kubernetes

2021-06-10 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361086#comment-17361086
 ] 

Erik Krogen commented on SPARK-35715:
-

Not sure about k8s, but at least for YARN this is expected -- the {{local:}} 
prefix indicates that the file(s) shouldn't be copied because they're already 
present on the local filesystems. The {{file:}} scheme is supposed to be used 
to indicate something that's currently on your local FS but should be 
distributed for you. Actually, from my understanding, it looks like the bug is 
when you use {{jars}}, which shouldn't be copying anything since you've used 
the {{local}} scheme .. But again, not sure if k8s is supposed to work 
differently. Hopefully someone with more experience there can chime in.

> Option "--files" with local:// prefix is not honoured for Spark on kubernetes
> -
>
> Key: SPARK-35715
> URL: https://issues.apache.org/jira/browse/SPARK-35715
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2, 3.1.2
>Reporter: Pardhu Madipalli
>Priority: Major
>
> When we provide a local file as a dependency using "--files" option, the file 
> is not getting copied to work directories of executors.
> h5. Example 1:
>  
> {code:java}
> $SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
> --deploy-mode cluster \ 
> --name spark-pi \ 
> --class org.apache.spark.examples.SparkPi \ 
> --conf spark.executor.instances=1 \ 
> --conf spark.kubernetes.container.image= \ 
> --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
> --files local:///etc/xattr.conf \ 
> local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
> {code}
>  
> h6. Content of Spark Executor work-dir:
>  
> {code:java}
> ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
> /opt/spark/work-dir/
> spark-examples_2.12-3.1.2.jar
> {code}
>  
> We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to  
> _/opt/spark/work-dir/ ._
>  
> 
>  
> {{Instead of using "–files", if we use "--jars" option the file is getting 
> copied as expected.}}
> h5. Example 2:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
> --deploy-mode cluster \ 
> --name spark-pi \ 
> --class org.apache.spark.examples.SparkPi \ 
> --conf spark.executor.instances=1 \ 
> --conf spark.kubernetes.container.image= \ 
> --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
> --jars local:///etc/xattr.conf \ 
> local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
> {code}
> h6. Content of Spark Executor work-dir:
>  
> {code:java}
> ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
> /opt/spark/work-dir/
> spark-examples_2.12-3.1.2.jar
> xattr.conf
> {code}
> We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to  
> _/opt/spark/work-dir/ ._
>  
> I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way 
> in both cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35715) Option "--files" with local:// prefix is not honoured for Spark on kubernetes

2021-06-10 Thread Pardhu Madipalli (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pardhu Madipalli updated SPARK-35715:
-
Description: 
When we provide a local file as a dependency using "--files" option, the file 
is not getting copied to work directories of executors.
h5. Example 1:

 
{code:java}
$SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
--deploy-mode cluster \ 
--name spark-pi \ 
--class org.apache.spark.examples.SparkPi \ 
--conf spark.executor.instances=1 \ 
--conf spark.kubernetes.container.image= \ 
--conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
--files local:///etc/xattr.conf \ 
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
{code}
 
h6. Content of Spark Executor work-dir:

 
{code:java}
~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
/opt/spark/work-dir/
spark-examples_2.12-3.1.2.jar
{code}
 

We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to  
_/opt/spark/work-dir/ ._

 

 

{{Instead of using "–files", if we use "--jars" option the file is getting 
copied as expected.}}
h5. Example 2:
{code:java}
$SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
--deploy-mode cluster \ 
--name spark-pi \ 
--class org.apache.spark.examples.SparkPi \ 
--conf spark.executor.instances=1 \ 
--conf spark.kubernetes.container.image= \ 
--conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
--jars local:///etc/xattr.conf \ 
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100


{code}
h6. Content of Spark Executor work-dir:

 
{code:java}
~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
/opt/spark/work-dir/
spark-examples_2.12-3.1.2.jar

xattr.conf

{code}
We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to  
_/opt/spark/work-dir/ ._

 

I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way in 
both cases.

  was:
When we provide a local file as a dependency using "--files" option, the file 
is not getting copied to work directories of executors.
h5. Example 1:

 
{code:java}
$SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
--deploy-mode cluster \ 
--name spark-pi \ 
--class org.apache.spark.examples.SparkPi \ 
--conf spark.executor.instances=1 \ 
--conf spark.kubernetes.container.image= \ 
--conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
--files local:///etc/xattr.conf \ 
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
{code}
 
h6. Content of Spark Executor work-dir:

 
{code:java}
~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
/opt/spark/work-dir/
spark-examples_2.12-3.1.2.jar
{code}
 

We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to  
_/opt/spark/work-dir/ ._

 

 

Instead of using "--files", if we use "--jars" option the file is getting 
copied as expected.
h5. Example 2:
{code:java}
$SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
--deploy-mode cluster \ 
--name spark-pi \ 
--class org.apache.spark.examples.SparkPi \ 
--conf spark.executor.instances=1 \ 
--conf spark.kubernetes.container.image= \ 
--conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
--jars local:///etc/xattr.conf \ 
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100


{code}
h6. Content of Spark Executor work-dir:

 
{code:java}
~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
/opt/spark/work-dir/
spark-examples_2.12-3.1.2.jar

xattr.conf

{code}
We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to  
_/opt/spark/work-dir/ ._

 

I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way in 
both cases.


> Option "--files" with local:// prefix is not honoured for Spark on kubernetes
> -
>
> Key: SPARK-35715
> URL: https://issues.apache.org/jira/browse/SPARK-35715
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2, 3.1.2
>Reporter: Pardhu Madipalli
>Priority: Major
>
> When we provide a local file as a dependency using "--files" option, the file 
> is not getting copied to work directories of executors.
> h5. Example 1:
>  
> {code:java}
> $SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
> --deploy-mode cluster \ 
> --name spark-pi \ 
> --class org.apache.spark.examples.SparkPi \ 
> --conf spark.executor.instances=1 \ 
> --conf spark.kubernetes.container.image= \ 
> --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
> --files local:///etc/xattr.conf \ 
> local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
> {code}
>  
> h6. Content of Spark Executor work-dir:
>  
> {code:java}
> ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
> /opt/spark/work-dir/
> spark-examples_2.12-3.1.2.jar
> {code}
>  
>

[jira] [Updated] (SPARK-35715) Option "--files" with local:// prefix is not honoured for Spark on kubernetes

2021-06-10 Thread Pardhu Madipalli (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pardhu Madipalli updated SPARK-35715:
-
Description: 
When we provide a local file as a dependency using "--files" option, the file 
is not getting copied to work directories of executors.
h5. Example 1:

 
{code:java}
$SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
--deploy-mode cluster \ 
--name spark-pi \ 
--class org.apache.spark.examples.SparkPi \ 
--conf spark.executor.instances=1 \ 
--conf spark.kubernetes.container.image= \ 
--conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
--files local:///etc/xattr.conf \ 
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
{code}
 
h6. Content of Spark Executor work-dir:

 
{code:java}
~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
/opt/spark/work-dir/
spark-examples_2.12-3.1.2.jar
{code}
 

We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to  
_/opt/spark/work-dir/ ._

 

 

Instead of using "--files", if we use "--jars" option the file is getting 
copied as expected.
h5. Example 2:
{code:java}
$SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
--deploy-mode cluster \ 
--name spark-pi \ 
--class org.apache.spark.examples.SparkPi \ 
--conf spark.executor.instances=1 \ 
--conf spark.kubernetes.container.image= \ 
--conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
--jars local:///etc/xattr.conf \ 
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100


{code}
h6. Content of Spark Executor work-dir:

 
{code:java}
~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
/opt/spark/work-dir/
spark-examples_2.12-3.1.2.jar

xattr.conf

{code}
We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to  
_/opt/spark/work-dir/ ._

 

I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way in 
both cases.

  was:
When we provide a local file as a dependency using "--files" option, the file 
is not getting copied to work directories of executors.
h5. Example 1:

 
{code:java}
$SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
--deploy-mode cluster \ 
--name spark-pi \ 
--class org.apache.spark.examples.SparkPi \ 
--conf spark.executor.instances=1 \ 
--conf spark.kubernetes.container.image= \ 
--conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
--files local:///etc/xattr.conf \ 
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
{code}
 
h6. Content of Spark Executor work-dir:

 
{code:java}
~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
/opt/spark/work-dir/
spark-examples_2.12-3.1.2.jar
{code}
 

We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to  
_/opt/spark/work-dir/ ._

 

 

Instead of using "--files", if we use "--jars" option the file is getting 
copied as expected.
h5. Example 2:
{code:java}
$SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
--deploy-mode cluster \ 
--name spark-pi \ 
--class org.apache.spark.examples.SparkPi \ 
--conf spark.executor.instances=1 \ 
--conf spark.kubernetes.container.image= \ 
--conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
--jars local:///etc/xattr.conf \ 
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100


{code}
h6. Content of Spark Executor work-dir:

 
{code:java}
~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
/opt/spark/work-dir/
spark-examples_2.12-3.1.2.jar

xattr.conf

{code}
We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to  
_/opt/spark/work-dir/ ._

 

I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way in 
both cases.


> Option "--files" with local:// prefix is not honoured for Spark on kubernetes
> -
>
> Key: SPARK-35715
> URL: https://issues.apache.org/jira/browse/SPARK-35715
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2, 3.1.2
>Reporter: Pardhu Madipalli
>Priority: Major
>
> When we provide a local file as a dependency using "--files" option, the file 
> is not getting copied to work directories of executors.
> h5. Example 1:
>  
> {code:java}
> $SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
> --deploy-mode cluster \ 
> --name spark-pi \ 
> --class org.apache.spark.examples.SparkPi \ 
> --conf spark.executor.instances=1 \ 
> --conf spark.kubernetes.container.image= \ 
> --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
> --files local:///etc/xattr.conf \ 
> local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
> {code}
>  
> h6. Content of Spark Executor work-dir:
>  
> {code:java}
> ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
> /opt/spark/work-dir/
> spark-examples_2.12-3.1.2.jar
> {code}
>  
> We

[jira] [Created] (SPARK-35715) Option "--files" with local:// prefix is not honoured for Spark on kubernetes

2021-06-10 Thread Pardhu Madipalli (Jira)

Pardhu Madipalli created SPARK-35715:


 Summary: Option "--files" with local:// prefix is not honoured for 
Spark on kubernetes
 Key: SPARK-35715
 URL: https://issues.apache.org/jira/browse/SPARK-35715
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.1.2, 3.0.2
Reporter: Pardhu Madipalli


When we provide a local file as a dependency using "--files" option, the file 
is not getting copied to work directories of executors.
h5. Example 1:

 
{code:java}
$SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
--deploy-mode cluster \ 
--name spark-pi \ 
--class org.apache.spark.examples.SparkPi \ 
--conf spark.executor.instances=1 \ 
--conf spark.kubernetes.container.image= \ 
--conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
--files local:///etc/xattr.conf \ 
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
{code}
 
h6. Content of Spark Executor work-dir:

 
{code:java}
~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
/opt/spark/work-dir/
spark-examples_2.12-3.1.2.jar
{code}
 

We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to  
_/opt/spark/work-dir/ ._

 

 

Instead of using "--files", if we use "--jars" option the file is getting 
copied as expected.
h5. Example 2:
{code:java}
$SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
--deploy-mode cluster \ 
--name spark-pi \ 
--class org.apache.spark.examples.SparkPi \ 
--conf spark.executor.instances=1 \ 
--conf spark.kubernetes.container.image= \ 
--conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
--jars local:///etc/xattr.conf \ 
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100


{code}
h6. Content of Spark Executor work-dir:

 
{code:java}
~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
/opt/spark/work-dir/
spark-examples_2.12-3.1.2.jar

xattr.conf

{code}
We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to  
_/opt/spark/work-dir/ ._

 

I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way in 
both cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35653) [SQL] CatalystToExternalMap interpreted path fails for Map with case classes as keys or values

2021-06-10 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-35653.
-
Fix Version/s: 3.0.3
   3.1.3
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 32783
[https://github.com/apache/spark/pull/32783]

> [SQL] CatalystToExternalMap interpreted path fails for Map with case classes 
> as keys or values
> --
>
> Key: SPARK-35653
> URL: https://issues.apache.org/jira/browse/SPARK-35653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Major
> Fix For: 3.2.0, 3.1.3, 3.0.3
>
>
> Interpreted path deserialization fails for Map with case classes as keys or 
> values while the codegen path works correctly.
> To reproduce the issue one can add test cases to the ExpressionEncoderSuite. 
> For example adding the following
> {noformat}
> case class IntAndString(i: Int, s: String)
> encodeDecodeTest(Map(1 -> IntAndString(1, "a")), "map with case class as 
> value")
> {noformat}
> It will succeed for the code gen path while the interpreted path will fail 
> with
> {noformat}
> [info] - encode/decode for map with case class as value: Map(1 -> 
> IntAndString(1,a)) (interpreted path) *** FAILED *** (64 milliseconds)
> [info] Encoded/Decoded data does not match input data
> [info]
> [info] in: Map(1 -> IntAndString(1,a))
> [info] out: Map(1 -> [1,a])
> [info] types: scala.collection.immutable.Map$Map1 [info]
> [info] Encoded Data: 
> [org.apache.spark.sql.catalyst.expressions.UnsafeMapData@5ecf5d9e]
> [info] Schema: value#823
> [info] root
> [info] -- value: map (nullable = true)
> [info] |-- key: integer
> [info] |-- value: struct (valueContainsNull = true)
> [info] | |-- i: integer (nullable = false)
> [info] | |-- s: string (nullable = true)
> [info]
> [info]
> [info] fromRow Expressions:
> [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, 
> IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, 
> IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), 
> if (isnull(lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 
> 179))) null else newInstance(class 
> org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, 
> map>, true], interface 
> scala.collection.immutable.Map
> [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178)
> [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178)
> [info] :- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)
> [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 
> 179))) null else newInstance(class 
> org.apache.spark.sql.catalyst.encoders.IntAndString)
> [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))
> [info] : : +- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)
> [info] : :- null
> [info] : +- newInstance(class 
> org.apache.spark.sql.catalyst.encoders.IntAndString)
> [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 
> 179).i)
> [info] : : +- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i
> [info] : : +- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)
> [info] : +- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 
> 179).s.toString
> [info] : +- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s
> [info] : +- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)
> [info] +- input[0, map>, true] 
> (ExpressionEncoderSuite.scala:627)
> {noformat}
> So the value was not correctly deserialized in the interpreted path.
> I have prepared a PR that I will submit for fixing this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Assigned] (SPARK-35653) [SQL] CatalystToExternalMap interpreted path fails for Map with case classes as keys or values

2021-06-10 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-35653:
---

Assignee: Emil Ejbyfeldt

> [SQL] CatalystToExternalMap interpreted path fails for Map with case classes 
> as keys or values
> --
>
> Key: SPARK-35653
> URL: https://issues.apache.org/jira/browse/SPARK-35653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Major
>
> Interpreted path deserialization fails for Map with case classes as keys or 
> values while the codegen path works correctly.
> To reproduce the issue one can add test cases to the ExpressionEncoderSuite. 
> For example adding the following
> {noformat}
> case class IntAndString(i: Int, s: String)
> encodeDecodeTest(Map(1 -> IntAndString(1, "a")), "map with case class as 
> value")
> {noformat}
> It will succeed for the code gen path while the interpreted path will fail 
> with
> {noformat}
> [info] - encode/decode for map with case class as value: Map(1 -> 
> IntAndString(1,a)) (interpreted path) *** FAILED *** (64 milliseconds)
> [info] Encoded/Decoded data does not match input data
> [info]
> [info] in: Map(1 -> IntAndString(1,a))
> [info] out: Map(1 -> [1,a])
> [info] types: scala.collection.immutable.Map$Map1 [info]
> [info] Encoded Data: 
> [org.apache.spark.sql.catalyst.expressions.UnsafeMapData@5ecf5d9e]
> [info] Schema: value#823
> [info] root
> [info] -- value: map (nullable = true)
> [info] |-- key: integer
> [info] |-- value: struct (valueContainsNull = true)
> [info] | |-- i: integer (nullable = false)
> [info] | |-- s: string (nullable = true)
> [info]
> [info]
> [info] fromRow Expressions:
> [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, 
> IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, 
> IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), 
> if (isnull(lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 
> 179))) null else newInstance(class 
> org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, 
> map>, true], interface 
> scala.collection.immutable.Map
> [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178)
> [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178)
> [info] :- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)
> [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 
> 179))) null else newInstance(class 
> org.apache.spark.sql.catalyst.encoders.IntAndString)
> [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))
> [info] : : +- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)
> [info] : :- null
> [info] : +- newInstance(class 
> org.apache.spark.sql.catalyst.encoders.IntAndString)
> [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 
> 179).i)
> [info] : : +- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i
> [info] : : +- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)
> [info] : +- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 
> 179).s.toString
> [info] : +- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s
> [info] : +- lambdavariable(CatalystToExternalMap_value, 
> StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)
> [info] +- input[0, map>, true] 
> (ExpressionEncoderSuite.scala:627)
> {noformat}
> So the value was not correctly deserialized in the interpreted path.
> I have prepared a PR that I will submit for fixing this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35711) Support casting of timestamp without time zone to timestamp type

2021-06-10 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-35711.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32864
[https://github.com/apache/spark/pull/32864]

> Support casting of timestamp without time zone to timestamp type
> 
>
> Key: SPARK-35711
> URL: https://issues.apache.org/jira/browse/SPARK-35711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Extend the Cast expression and support TimestampWithoutTZType in casting to 
> TimestampType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35714) Bug fix for deadlock during the executor shutdown

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360981#comment-17360981
 ] 

Apache Spark commented on SPARK-35714:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/32868

> Bug fix for deadlock during the executor shutdown
> -
>
> Key: SPARK-35714
> URL: https://issues.apache.org/jira/browse/SPARK-35714
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Wan Kun
>Priority: Minor
>
> When a executor received a TERM signal, it (the second TERM signal) will lock 
> java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM.
>  Shutdown will call SparkShutdownHook to shutdown the executor.
>  During the executor shutdown phase, RemoteProcessDisconnected event will be 
> send to the RPC inbox, and then WorkerWatcher will try to call 
> System.exit(-1) again.
>  Because java.lang.Shutdown has already locked, a deadlock has occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35713) Bug fix for thread leak in JobCancellationSuite

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35713:


Assignee: (was: Apache Spark)

> Bug fix for thread leak in JobCancellationSuite
> ---
>
> Key: SPARK-35713
> URL: https://issues.apache.org/jira/browse/SPARK-35713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Wan Kun
>Priority: Minor
>
> When we call Thread.interrupt() method, that thread's interrupt status will 
> be set but it may not really interrupt.
> So when spark task runs in an infinite loop, spark context may fail to 
> interrupt the task thread, and resulting in thread leak. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35714) Bug fix for deadlock during the executor shutdown

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35714:


Assignee: Apache Spark

> Bug fix for deadlock during the executor shutdown
> -
>
> Key: SPARK-35714
> URL: https://issues.apache.org/jira/browse/SPARK-35714
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Minor
>
> When a executor received a TERM signal, it (the second TERM signal) will lock 
> java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM.
>  Shutdown will call SparkShutdownHook to shutdown the executor.
>  During the executor shutdown phase, RemoteProcessDisconnected event will be 
> send to the RPC inbox, and then WorkerWatcher will try to call 
> System.exit(-1) again.
>  Because java.lang.Shutdown has already locked, a deadlock has occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35714) Bug fix for deadlock during the executor shutdown

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360977#comment-17360977
 ] 

Apache Spark commented on SPARK-35714:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/32868

> Bug fix for deadlock during the executor shutdown
> -
>
> Key: SPARK-35714
> URL: https://issues.apache.org/jira/browse/SPARK-35714
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Wan Kun
>Priority: Minor
>
> When a executor received a TERM signal, it (the second TERM signal) will lock 
> java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM.
>  Shutdown will call SparkShutdownHook to shutdown the executor.
>  During the executor shutdown phase, RemoteProcessDisconnected event will be 
> send to the RPC inbox, and then WorkerWatcher will try to call 
> System.exit(-1) again.
>  Because java.lang.Shutdown has already locked, a deadlock has occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35713) Bug fix for thread leak in JobCancellationSuite

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35713:


Assignee: Apache Spark

> Bug fix for thread leak in JobCancellationSuite
> ---
>
> Key: SPARK-35713
> URL: https://issues.apache.org/jira/browse/SPARK-35713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Minor
>
> When we call Thread.interrupt() method, that thread's interrupt status will 
> be set but it may not really interrupt.
> So when spark task runs in an infinite loop, spark context may fail to 
> interrupt the task thread, and resulting in thread leak. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35714) Bug fix for deadlock during the executor shutdown

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35714:


Assignee: (was: Apache Spark)

> Bug fix for deadlock during the executor shutdown
> -
>
> Key: SPARK-35714
> URL: https://issues.apache.org/jira/browse/SPARK-35714
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Wan Kun
>Priority: Minor
>
> When a executor received a TERM signal, it (the second TERM signal) will lock 
> java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM.
>  Shutdown will call SparkShutdownHook to shutdown the executor.
>  During the executor shutdown phase, RemoteProcessDisconnected event will be 
> send to the RPC inbox, and then WorkerWatcher will try to call 
> System.exit(-1) again.
>  Because java.lang.Shutdown has already locked, a deadlock has occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35713) Bug fix for thread leak in JobCancellationSuite

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360976#comment-17360976
 ] 

Apache Spark commented on SPARK-35713:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/32866

> Bug fix for thread leak in JobCancellationSuite
> ---
>
> Key: SPARK-35713
> URL: https://issues.apache.org/jira/browse/SPARK-35713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Wan Kun
>Priority: Minor
>
> When we call Thread.interrupt() method, that thread's interrupt status will 
> be set but it may not really interrupt.
> So when spark task runs in an infinite loop, spark context may fail to 
> interrupt the task thread, and resulting in thread leak. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35714) Bug fix for deadlock during the executor shutdown

2021-06-10 Thread Wan Kun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-35714:

Description: 
When a executor received a TERM signal, it (the second TERM signal) will lock 
java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM.
 Shutdown will call SparkShutdownHook to shutdown the executor.
 During the executor shutdown phase, RemoteProcessDisconnected event will be 
send to the RPC inbox, and then WorkerWatcher will try to call System.exit(-1) 
again.
 Because java.lang.Shutdown has already locked, a deadlock has occurred.

  was:
WHen a executor recived a TERM signal, it (the second TERM signal) will lock 
java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM.
Shutdown will call SparkShutdownHook to shutdown the executor.
During the executor shutdown phase, RemoteProcessDisconnected event will be 
send to the RPC inbox, and then WorkerWatcher will try to call System.exit(-1) 
again.
Because java.lang.Shutdown has already locked, a deadlock has occurred.


> Bug fix for deadlock during the executor shutdown
> -
>
> Key: SPARK-35714
> URL: https://issues.apache.org/jira/browse/SPARK-35714
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Wan Kun
>Priority: Minor
>
> When a executor received a TERM signal, it (the second TERM signal) will lock 
> java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM.
>  Shutdown will call SparkShutdownHook to shutdown the executor.
>  During the executor shutdown phase, RemoteProcessDisconnected event will be 
> send to the RPC inbox, and then WorkerWatcher will try to call 
> System.exit(-1) again.
>  Because java.lang.Shutdown has already locked, a deadlock has occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35714) Bug fix for deadlock during the executor shutdown

2021-06-10 Thread Wan Kun (Jira)

Wan Kun created SPARK-35714:
---

 Summary: Bug fix for deadlock during the executor shutdown
 Key: SPARK-35714
 URL: https://issues.apache.org/jira/browse/SPARK-35714
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Wan Kun


WHen a executor recived a TERM signal, it (the second TERM signal) will lock 
java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM.
Shutdown will call SparkShutdownHook to shutdown the executor.
During the executor shutdown phase, RemoteProcessDisconnected event will be 
send to the RPC inbox, and then WorkerWatcher will try to call System.exit(-1) 
again.
Because java.lang.Shutdown has already locked, a deadlock has occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35713) Bug fix for thread leak in JobCancellationSuite

2021-06-10 Thread Wan Kun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-35713:

Summary: Bug fix for thread leak in JobCancellationSuite  (was: Bugfix for 
thread leak in JobCancellationSuite)

> Bug fix for thread leak in JobCancellationSuite
> ---
>
> Key: SPARK-35713
> URL: https://issues.apache.org/jira/browse/SPARK-35713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Wan Kun
>Priority: Minor
>
> When we call Thread.interrupt() method, that thread's interrupt status will 
> be set but it may not really interrupt.
> So when spark task runs in an infinite loop, spark context may fail to 
> interrupt the task thread, and resulting in thread leak. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35713) Bugfix for thread leak in JobCancellationSuite

2021-06-10 Thread Wan Kun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-35713:

Description: 
When we call Thread.interrupt() method, that thread's interrupt status will be 
set but it may not really interrupt.

So when spark task runs in an infinite loop, spark context may fail to 
interrupt the task thread, and resulting in thread leak. 

 

  was:
When the task runs in an infinite loop, spark context may fail to interrupt the 
task thread, resulting in thread leak. 

 


> Bugfix for thread leak in JobCancellationSuite
> --
>
> Key: SPARK-35713
> URL: https://issues.apache.org/jira/browse/SPARK-35713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Wan Kun
>Priority: Minor
>
> When we call Thread.interrupt() method, that thread's interrupt status will 
> be set but it may not really interrupt.
> So when spark task runs in an infinite loop, spark context may fail to 
> interrupt the task thread, and resulting in thread leak. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35713) Bugfix for thread leak in JobCancellationSuite

2021-06-10 Thread Wan Kun (Jira)

Wan Kun created SPARK-35713:
---

 Summary: Bugfix for thread leak in JobCancellationSuite
 Key: SPARK-35713
 URL: https://issues.apache.org/jira/browse/SPARK-35713
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Wan Kun


When the task runs in an infinite loop, spark context may fail to interrupt the 
task thread, resulting in thread leak. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35712) Simplify ResolveAggregateFunctions

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35712:


Assignee: Apache Spark

> Simplify ResolveAggregateFunctions
> --
>
> Key: SPARK-35712
> URL: https://issues.apache.org/jira/browse/SPARK-35712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35712) Simplify ResolveAggregateFunctions

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35712:


Assignee: (was: Apache Spark)

> Simplify ResolveAggregateFunctions
> --
>
> Key: SPARK-35712
> URL: https://issues.apache.org/jira/browse/SPARK-35712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35712) Simplify ResolveAggregateFunctions

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360892#comment-17360892
 ] 

Apache Spark commented on SPARK-35712:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32470

> Simplify ResolveAggregateFunctions
> --
>
> Key: SPARK-35712
> URL: https://issues.apache.org/jira/browse/SPARK-35712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35712) Simplify ResolveAggregateFunctions

2021-06-10 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-35712:
---

 Summary: Simplify ResolveAggregateFunctions
 Key: SPARK-35712
 URL: https://issues.apache.org/jira/browse/SPARK-35712
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35701) Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35701:


Assignee: Haiyang Sun  (was: Apache Spark)

> Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys
> ---
>
> Key: SPARK-35701
> URL: https://issues.apache.org/jira/browse/SPARK-35701
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Haiyang Sun
>Assignee: Haiyang Sun
>Priority: Major
>
> The global locks used to protect {{SQLConf.sqlConfEntries}} map and the 
> {{SQLConf.staticConfKeys}} set can cause significant contention on the 
> SQLConf instance in a concurrent setting. 
> Using copy-on-write versions should reduce the contention given that 
> modifications to the configs are relatively rare.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35701) Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35701:


Assignee: Apache Spark  (was: Haiyang Sun)

> Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys
> ---
>
> Key: SPARK-35701
> URL: https://issues.apache.org/jira/browse/SPARK-35701
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Haiyang Sun
>Assignee: Apache Spark
>Priority: Major
>
> The global locks used to protect {{SQLConf.sqlConfEntries}} map and the 
> {{SQLConf.staticConfKeys}} set can cause significant contention on the 
> SQLConf instance in a concurrent setting. 
> Using copy-on-write versions should reduce the contention given that 
> modifications to the configs are relatively rare.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35701) Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360881#comment-17360881
 ] 

Apache Spark commented on SPARK-35701:
--

User 'haiyangsun-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/32865

> Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys
> ---
>
> Key: SPARK-35701
> URL: https://issues.apache.org/jira/browse/SPARK-35701
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Haiyang Sun
>Assignee: Haiyang Sun
>Priority: Major
>
> The global locks used to protect {{SQLConf.sqlConfEntries}} map and the 
> {{SQLConf.staticConfKeys}} set can cause significant contention on the 
> SQLConf instance in a concurrent setting. 
> Using copy-on-write versions should reduce the contention given that 
> modifications to the configs are relatively rare.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35701) Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35701:


Assignee: Apache Spark  (was: Haiyang Sun)

> Contention on SQLConf.sqlConfEntries and SQLConf.staticConfKeys
> ---
>
> Key: SPARK-35701
> URL: https://issues.apache.org/jira/browse/SPARK-35701
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Haiyang Sun
>Assignee: Apache Spark
>Priority: Major
>
> The global locks used to protect {{SQLConf.sqlConfEntries}} map and the 
> {{SQLConf.staticConfKeys}} set can cause significant contention on the 
> SQLConf instance in a concurrent setting. 
> Using copy-on-write versions should reduce the contention given that 
> modifications to the configs are relatively rare.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35711) Support casting of timestamp without time zone to timestamp type

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360759#comment-17360759
 ] 

Apache Spark commented on SPARK-35711:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32864

> Support casting of timestamp without time zone to timestamp type
> 
>
> Key: SPARK-35711
> URL: https://issues.apache.org/jira/browse/SPARK-35711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Extend the Cast expression and support TimestampWithoutTZType in casting to 
> TimestampType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35711) Support casting of timestamp without time zone to timestamp type

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360758#comment-17360758
 ] 

Apache Spark commented on SPARK-35711:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32864

> Support casting of timestamp without time zone to timestamp type
> 
>
> Key: SPARK-35711
> URL: https://issues.apache.org/jira/browse/SPARK-35711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Extend the Cast expression and support TimestampWithoutTZType in casting to 
> TimestampType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35711) Support casting of timestamp without time zone to timestamp type

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35711:


Assignee: Apache Spark  (was: Gengliang Wang)

> Support casting of timestamp without time zone to timestamp type
> 
>
> Key: SPARK-35711
> URL: https://issues.apache.org/jira/browse/SPARK-35711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Extend the Cast expression and support TimestampWithoutTZType in casting to 
> TimestampType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35711) Support casting of timestamp without time zone to timestamp type

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35711:


Assignee: Gengliang Wang  (was: Apache Spark)

> Support casting of timestamp without time zone to timestamp type
> 
>
> Key: SPARK-35711
> URL: https://issues.apache.org/jira/browse/SPARK-35711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Extend the Cast expression and support TimestampWithoutTZType in casting to 
> TimestampType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35711) Support casting of timestamp without time zone to timestamp type

2021-06-10 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-35711:
--

 Summary: Support casting of timestamp without time zone to 
timestamp type
 Key: SPARK-35711
 URL: https://issues.apache.org/jira/browse/SPARK-35711
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Extend the Cast expression and support TimestampWithoutTZType in casting to 
TimestampType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35652) Different Behaviour join vs joinWith in self joining

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35652:


Assignee: (was: Apache Spark)

> Different Behaviour join vs joinWith in self joining
> 
>
> Key: SPARK-35652
> URL: https://issues.apache.org/jira/browse/SPARK-35652
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: {color:#172b4d}Spark 3.1.2{color}
> Scala 2.12
>Reporter: Wassim Almaaoui
>Priority: Critical
>
> It seems like spark inner join is performing a cartesian join in self joining 
> using `joinWith` and an inner join using `join` 
> Snippet:
>  
> {code:java}
> scala> val df = spark.range(0,5) 
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> df.show 
> +---+ 
> | id| 
> +---+ 
> | 0|
> | 1| 
> | 2| 
> | 3| 
> | 4| 
> +---+ 
> scala> df.join(df, df("id") === df("id")).count 
> 21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate, 
> 'id#1649L = id#1649L'. Perhaps you need to use aliases. 
> res21: Long = 5
> scala> df.joinWith(df, df("id") === df("id")).count
> 21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate, 
> 'id#1649L = id#1649L'. Perhaps you need to use aliases. 
> res22: Long = 25
> {code}
> According to the comment in code source, joinWith is expected to manage this 
> case, right?
> {code:java}
> def joinWith[U](other: Dataset[U], condition: Column, joinType: String): 
> Dataset[(T, U)] = {
> // Creates a Join node and resolve it first, to get join condition 
> resolved, self-join resolved,
> // etc.
> {code}
> I find it weird that join and joinWith haven't the same behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35652) Different Behaviour join vs joinWith in self joining

2021-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360747#comment-17360747
 ] 

Apache Spark commented on SPARK-35652:
--

User 'dgd-contributor' has created a pull request for this issue:
https://github.com/apache/spark/pull/32863

> Different Behaviour join vs joinWith in self joining
> 
>
> Key: SPARK-35652
> URL: https://issues.apache.org/jira/browse/SPARK-35652
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: {color:#172b4d}Spark 3.1.2{color}
> Scala 2.12
>Reporter: Wassim Almaaoui
>Priority: Critical
>
> It seems like spark inner join is performing a cartesian join in self joining 
> using `joinWith` and an inner join using `join` 
> Snippet:
>  
> {code:java}
> scala> val df = spark.range(0,5) 
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> df.show 
> +---+ 
> | id| 
> +---+ 
> | 0|
> | 1| 
> | 2| 
> | 3| 
> | 4| 
> +---+ 
> scala> df.join(df, df("id") === df("id")).count 
> 21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate, 
> 'id#1649L = id#1649L'. Perhaps you need to use aliases. 
> res21: Long = 5
> scala> df.joinWith(df, df("id") === df("id")).count
> 21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate, 
> 'id#1649L = id#1649L'. Perhaps you need to use aliases. 
> res22: Long = 25
> {code}
> According to the comment in code source, joinWith is expected to manage this 
> case, right?
> {code:java}
> def joinWith[U](other: Dataset[U], condition: Column, joinType: String): 
> Dataset[(T, U)] = {
> // Creates a Join node and resolve it first, to get join condition 
> resolved, self-join resolved,
> // etc.
> {code}
> I find it weird that join and joinWith haven't the same behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35652) Different Behaviour join vs joinWith in self joining

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35652:


Assignee: Apache Spark

> Different Behaviour join vs joinWith in self joining
> 
>
> Key: SPARK-35652
> URL: https://issues.apache.org/jira/browse/SPARK-35652
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: {color:#172b4d}Spark 3.1.2{color}
> Scala 2.12
>Reporter: Wassim Almaaoui
>Assignee: Apache Spark
>Priority: Critical
>
> It seems like spark inner join is performing a cartesian join in self joining 
> using `joinWith` and an inner join using `join` 
> Snippet:
>  
> {code:java}
> scala> val df = spark.range(0,5) 
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> df.show 
> +---+ 
> | id| 
> +---+ 
> | 0|
> | 1| 
> | 2| 
> | 3| 
> | 4| 
> +---+ 
> scala> df.join(df, df("id") === df("id")).count 
> 21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate, 
> 'id#1649L = id#1649L'. Perhaps you need to use aliases. 
> res21: Long = 5
> scala> df.joinWith(df, df("id") === df("id")).count
> 21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate, 
> 'id#1649L = id#1649L'. Perhaps you need to use aliases. 
> res22: Long = 25
> {code}
> According to the comment in code source, joinWith is expected to manage this 
> case, right?
> {code:java}
> def joinWith[U](other: Dataset[U], condition: Column, joinType: String): 
> Dataset[(T, U)] = {
> // Creates a Join node and resolve it first, to get join condition 
> resolved, self-join resolved,
> // etc.
> {code}
> I find it weird that join and joinWith haven't the same behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35695) QueryExecutionListener does not see any observed metrics fired before persist/cache

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35695:


Assignee: (was: Apache Spark)

> QueryExecutionListener does not see any observed metrics fired before 
> persist/cache
> ---
>
> Key: SPARK-35695
> URL: https://issues.apache.org/jira/browse/SPARK-35695
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> This example properly fires the event
> {code}
> spark.range(100)
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .collect()
> {code}
> But when I add persist, then no event is fired or seen (not sure which):
> {code}
> spark.range(100)
>   .observe(
> name = "my_event",
> avg($"id").cast("int").as("avg_val"))
>   .persist()
>   .collect()
> {code}
> The listener:
> {code}
> val metricMaps = ArrayBuffer.empty[Map[String, Row]]
> val listener = new QueryExecutionListener {
>   override def onSuccess(funcName: String, qe: QueryExecution, duration: 
> Long): Unit = {
> metricMaps += qe.observedMetrics
>   }
>   override def onFailure(funcName: String, qe: QueryExecution, exception: 
> Exception): Unit = {
> // No-op
>   }
> }
> spark.listenerManager.register(listener)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35695) QueryExecutionListener does not see any observed metrics fired before persist/cache

2021-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35695:


Assignee: Apache Spark

> QueryExecutionListener does not see any observed metrics fired before 
> persist/cache
> ---
>
> Key: SPARK-35695
> URL: https://issues.apache.org/jira/browse/SPARK-35695
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Major
>
> This example properly fires the event
> {code}
> spark.range(100)
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .collect()
> {code}
> But when I add persist, then no event is fired or seen (not sure which):
> {code}
> spark.range(100)
>   .observe(
> name = "my_event",
> avg($"id").cast("int").as("avg_val"))
>   .persist()
>   .collect()
> {code}
> The listener:
> {code}
> val metricMaps = ArrayBuffer.empty[Map[String, Row]]
> val listener = new QueryExecutionListener {
>   override def onSuccess(funcName: String, qe: QueryExecution, duration: 
> Long): Unit = {
> metricMaps += qe.observedMetrics
>   }
>   override def onFailure(funcName: String, qe: QueryExecution, exception: 
> Exception): Unit = {
> // No-op
>   }
> }
> spark.listenerManager.register(listener)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 144 matches

Mail list logo