[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492420#comment-17492420
 ] 

Apache Spark commented on SPARK-36808:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/35525

> Upgrade Kafka to 2.8.1
> --
>
> Key: SPARK-36808
> URL: https://issues.apache.org/jira/browse/SPARK-36808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix.
> https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492417#comment-17492417
 ] 

Apache Spark commented on SPARK-36808:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/35523

> Upgrade Kafka to 2.8.1
> --
>
> Key: SPARK-36808
> URL: https://issues.apache.org/jira/browse/SPARK-36808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix.
> https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492413#comment-17492413
 ] 

Apache Spark commented on SPARK-36808:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/35523

> Upgrade Kafka to 2.8.1
> --
>
> Key: SPARK-36808
> URL: https://issues.apache.org/jira/browse/SPARK-36808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix.
> https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492414#comment-17492414
 ] 

Apache Spark commented on SPARK-36808:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/35524

> Upgrade Kafka to 2.8.1
> --
>
> Key: SPARK-36808
> URL: https://issues.apache.org/jira/browse/SPARK-36808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix.
> https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1

2022-02-14 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492408#comment-17492408
 ] 

Kousuke Saruta commented on SPARK-36808:


[~dongjoon] Sure, I'll do it.

> Upgrade Kafka to 2.8.1
> --
>
> Key: SPARK-36808
> URL: https://issues.apache.org/jira/browse/SPARK-36808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix.
> https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37707) Allow store assignment between TimestampNTZ and Date/Timestamp

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492405#comment-17492405
 ] 

Apache Spark commented on SPARK-37707:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35522

> Allow store assignment between TimestampNTZ  and Date/Timestamp
> ---
>
> Key: SPARK-37707
> URL: https://issues.apache.org/jira/browse/SPARK-37707
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Allow store assigment between:
>  * TimestampNTZ <=> Date
>  * TimestampNTZ <=> Timestamp



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38213) support Metrics information report to kafkaSink.

2022-02-14 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492391#comment-17492391
 ] 

Senthil Kumar commented on SPARK-38213:
---

Working on this

> support Metrics information report to kafkaSink.
> 
>
> Key: SPARK-38213
> URL: https://issues.apache.org/jira/browse/SPARK-38213
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: YuanGuanhu
>Priority: Major
>
> Spark now support ConsoleSink/CsvSink/GraphiteSink/JmxSink etc. Now we want 
> report metrics information to kafka, we can work to support this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38213) support Metrics information report to kafkaSink.

2022-02-14 Thread YuanGuanhu (Jira)
YuanGuanhu created SPARK-38213:
--

 Summary: support Metrics information report to kafkaSink.
 Key: SPARK-38213
 URL: https://issues.apache.org/jira/browse/SPARK-38213
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.2.1
Reporter: YuanGuanhu


Spark now support ConsoleSink/CsvSink/GraphiteSink/JmxSink etc. Now we want 
report metrics information to kafka, we can work to support this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38212) Streaming dropDuplicates should remove out-of-watermark keys

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38212:


Assignee: (was: Apache Spark)

> Streaming dropDuplicates should remove out-of-watermark keys
> 
>
> Key: SPARK-38212
> URL: https://issues.apache.org/jira/browse/SPARK-38212
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently streaming dropDuplicates simply stores all states forever if no 
> watermark-attached key exists.
> For streaming stateful operator, this seems counterintuitive behavior if we 
> consider other stateful operations such as streaming joining.
> It also means the number of state will grow up infinitely as we observe in 
> real application. But it doesn't make sense to add a time column into dedup 
> keys because we don't really deduplicate rows by [key, time] but only by 
> [key].
> More reasonable streaming dedup seems to remove out-of-watermark states if 
> the input data (not key) has watermark. This will be a behavior change, for 
> streaming queries with dedup operation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38212) Streaming dropDuplicates should remove out-of-watermark keys

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492386#comment-17492386
 ] 

Apache Spark commented on SPARK-38212:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/35521

> Streaming dropDuplicates should remove out-of-watermark keys
> 
>
> Key: SPARK-38212
> URL: https://issues.apache.org/jira/browse/SPARK-38212
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently streaming dropDuplicates simply stores all states forever if no 
> watermark-attached key exists.
> For streaming stateful operator, this seems counterintuitive behavior if we 
> consider other stateful operations such as streaming joining.
> It also means the number of state will grow up infinitely as we observe in 
> real application. But it doesn't make sense to add a time column into dedup 
> keys because we don't really deduplicate rows by [key, time] but only by 
> [key].
> More reasonable streaming dedup seems to remove out-of-watermark states if 
> the input data (not key) has watermark. This will be a behavior change, for 
> streaming queries with dedup operation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38212) Streaming dropDuplicates should remove out-of-watermark keys

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38212:


Assignee: Apache Spark

> Streaming dropDuplicates should remove out-of-watermark keys
> 
>
> Key: SPARK-38212
> URL: https://issues.apache.org/jira/browse/SPARK-38212
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Currently streaming dropDuplicates simply stores all states forever if no 
> watermark-attached key exists.
> For streaming stateful operator, this seems counterintuitive behavior if we 
> consider other stateful operations such as streaming joining.
> It also means the number of state will grow up infinitely as we observe in 
> real application. But it doesn't make sense to add a time column into dedup 
> keys because we don't really deduplicate rows by [key, time] but only by 
> [key].
> More reasonable streaming dedup seems to remove out-of-watermark states if 
> the input data (not key) has watermark. This will be a behavior change, for 
> streaming queries with dedup operation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38212) Streaming dropDuplicates should remove out-of-watermark keys

2022-02-14 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-38212:
---

 Summary: Streaming dropDuplicates should remove out-of-watermark 
keys
 Key: SPARK-38212
 URL: https://issues.apache.org/jira/browse/SPARK-38212
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: L. C. Hsieh


Currently streaming dropDuplicates simply stores all states forever if no 
watermark-attached key exists.

For streaming stateful operator, this seems counterintuitive behavior if we 
consider other stateful operations such as streaming joining.

It also means the number of state will grow up infinitely as we observe in real 
application. But it doesn't make sense to add a time column into dedup keys 
because we don't really deduplicate rows by [key, time] but only by [key].

More reasonable streaming dedup seems to remove out-of-watermark states if the 
input data (not key) has watermark. This will be a behavior change, for 
streaming queries with dedup operation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1

2022-02-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492372#comment-17492372
 ] 

Dongjoon Hyun commented on SPARK-36808:
---

+1 for backporting. Could you make a PR, please, [~sarutak]?

> Upgrade Kafka to 2.8.1
> --
>
> Key: SPARK-36808
> URL: https://issues.apache.org/jira/browse/SPARK-36808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix.
> https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38203) Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI mode

2022-02-14 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-38203.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35511
[https://github.com/apache/spark/pull/35511]

> Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI mode
> -
>
> Key: SPARK-38203
> URL: https://issues.apache.org/jira/browse/SPARK-38203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38145) Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True

2022-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38145.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35454
[https://github.com/apache/spark/pull/35454]

> Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
> 
>
> Key: SPARK-38145
> URL: https://issues.apache.org/jira/browse/SPARK-38145
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
>
> The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but 
> several tests are failed when the `spark.sql.ansi.enabled` is set to True.
> For stability of project, it's always good to test passed regardless of the 
> option value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38145) Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True

2022-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38145:


Assignee: Haejoon Lee

> Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
> 
>
> Key: SPARK-38145
> URL: https://issues.apache.org/jira/browse/SPARK-38145
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but 
> several tests are failed when the `spark.sql.ansi.enabled` is set to True.
> For stability of project, it's always good to test passed regardless of the 
> option value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37867) Compile aggregate functions of build-in JDBC dialect

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492351#comment-17492351
 ] 

Apache Spark commented on SPARK-37867:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35520

> Compile aggregate functions of build-in JDBC dialect
> 
>
> Key: SPARK-37867
> URL: https://issues.apache.org/jira/browse/SPARK-37867
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37867) Compile aggregate functions of build-in JDBC dialect

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492350#comment-17492350
 ] 

Apache Spark commented on SPARK-37867:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35520

> Compile aggregate functions of build-in JDBC dialect
> 
>
> Key: SPARK-37867
> URL: https://issues.apache.org/jira/browse/SPARK-37867
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38207) Add bucketed scan behavior change to migration guide

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38207:


Assignee: Apache Spark

> Add bucketed scan behavior change to migration guide
> 
>
> Key: SPARK-38207
> URL: https://issues.apache.org/jira/browse/SPARK-38207
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Manu Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> Default behavior of bucketed scan is changed in 
> https://issues.apache.org/jira/browse/SPARK-32859 but it's not mentioned in 
> SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38207) Add bucketed scan behavior change to migration guide

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492346#comment-17492346
 ] 

Apache Spark commented on SPARK-38207:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35514

> Add bucketed scan behavior change to migration guide
> 
>
> Key: SPARK-38207
> URL: https://issues.apache.org/jira/browse/SPARK-38207
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Manu Zhang
>Priority: Minor
>
> Default behavior of bucketed scan is changed in 
> https://issues.apache.org/jira/browse/SPARK-32859 but it's not mentioned in 
> SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38211) Add SQL migration guide on restoring loose upcast from string

2022-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38211.
--
Fix Version/s: 3.3.0
   3.0.4
   3.2.2
   3.1.3
   Resolution: Fixed

Issue resolved by pull request 35519
[https://github.com/apache/spark/pull/35519]

> Add SQL migration guide on restoring loose upcast from string
> -
>
> Key: SPARK-38211
> URL: https://issues.apache.org/jira/browse/SPARK-38211
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.3.0, 3.0.4, 3.2.2, 3.1.3
>
>
> After SPARK-24586, loose upcasting from string to other types are not allowed 
> by default. User can still set {{spark.sql.legacy.looseUpcast=true}} to 
> preserve old behavior but it's not documented in the SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38207) Add bucketed scan behavior change to migration guide

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38207:


Assignee: (was: Apache Spark)

> Add bucketed scan behavior change to migration guide
> 
>
> Key: SPARK-38207
> URL: https://issues.apache.org/jira/browse/SPARK-38207
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Manu Zhang
>Priority: Minor
>
> Default behavior of bucketed scan is changed in 
> https://issues.apache.org/jira/browse/SPARK-32859 but it's not mentioned in 
> SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38211) Add SQL migration guide on restoring loose upcast from string

2022-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38211:


Assignee: Manu Zhang

> Add SQL migration guide on restoring loose upcast from string
> -
>
> Key: SPARK-38211
> URL: https://issues.apache.org/jira/browse/SPARK-38211
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
>
> After SPARK-24586, loose upcasting from string to other types are not allowed 
> by default. User can still set {{spark.sql.legacy.looseUpcast=true}} to 
> preserve old behavior but it's not documented in the SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38183) Show warning when creating pandas-on-Spark session under ANSI mode.

2022-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38183.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35488
[https://github.com/apache/spark/pull/35488]

> Show warning when creating pandas-on-Spark session under ANSI mode.
> ---
>
> Key: SPARK-38183
> URL: https://issues.apache.org/jira/browse/SPARK-38183
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
>
> Since pandas API on Spark follows the behavior of pandas, not SQL, some 
> unexpected behavior can be occurred when "spark.sql.ansi.enable" is True.
> For example,
>  * It raises exception when {{div}} & {{mod}} related methods returns null 
> (e.g. {{{}DataFrame.rmod{}}})
> {code:java}
> >>> df
>angels  degress
> 0   0  360
> 1   3  180
> 2   4  360
> >>> df.rmod(2)
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 32.0 (TID 165) (172.30.1.44 executor driver): 
> org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
> instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
> (except for ANSI interval type) to bypass this error.{code}
>  * It raises exception when DataFrame for {{ps.melt}} has not the same column 
> type.
>  
> {code:java}
> >>> df
>    A  B  C
> 0  a  1  2
> 1  b  3  4
> 2  c  5  6
> >>> ps.melt(df)
> Traceback (most recent call last):
> ...
> pyspark.sql.utils.AnalysisException: cannot resolve 'array(struct('A', A), 
> struct('B', B), struct('C', C))' due to data type mismatch: input to function 
> array should all be the same type, but it's 
> [struct, struct, 
> struct]
> To fix the error, you might need to add explicit type casts. If necessary set 
> spark.sql.ansi.enabled to false to bypass this error.;
> 'Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
> __natural_order__#231L, explode(array(struct(variable, A, value, A#224), 
> struct(variable, B, value, B#225L), struct(variable, C, value, C#226L))) AS 
> pairs#269]
> +- Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
> monotonically_increasing_id() AS __natural_order__#231L]
>    +- LogicalRDD [__index_level_0__#223L, A#224, B#225L, C#226L], false{code}
>  * It raises exception when {{CategoricalIndex.remove_categories}} doesn't 
> remove the entire index
> {code:java}
> >>> idx
> CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], 
> ordered=False, dtype='category')
> >>> idx.remove_categories('b')
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 
> 215)
> org.apache.spark.SparkNoSuchElementException: Key b does not exist. If 
> necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
> error.
> ...
> ...{code}
>  * It raises exception when {{CategoricalIndex.set_categories}} doesn't set 
> the entire index
> {code:java}
> >>> idx.set_categories(['b', 'c'])
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 
> 215)
> org.apache.spark.SparkNoSuchElementException: Key a does not exist. If 
> necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
> error.
> ...
> ...{code}
>  * It raises exception when {{ps.to_numeric}} get a non-numeric type
> {code:java}
> >>> psser
> 0    apple
> 1      1.0
> 2        2
> 3       -3
> dtype: object
> >>> ps.to_numeric(psser)
> 22/02/14 09:22:36 ERROR Executor: Exception in task 2.0 in stage 63.0 (TID 
> 328)
> org.apache.spark.SparkNumberFormatException: invalid input syntax for type 
> numeric: apple. To return NULL instead, use 'try_cast'. If necessary set 
> spark.sql.ansi.enabled to false to bypass this error.
> ...{code}
>  * It raises exception when {{strings.StringMethods.rsplit}} - also 
> {{strings.StringMethods.split}} - with {{expand=True}} returns null columns
> {code:java}
> >>> s
> 0                       this is a regular sentence
> 1    https://docs.python.org/3/tutorial/index.html
> 2                                             None
> dtype: object
> >>> s.str.split(n=4, expand=True)
> 22/02/14 09:26:23 ERROR Executor: Exception in task 5.0 in stage 69.0 (TID 
> 356)
> org.apache.spark.SparkArrayIndexOutOfBoundsException: Invalid index: 1, 
> numElements: 1. If necessary set spark.sql.ansi.strictIndexOperator to false 
> to bypass this error.{code}
>  * It raises exception when {{as_type}} with {{{}CategoricalDtype{}}}, and 
> the categories of {{CategoricalDtype}} is not matched with data.
> {code:java}
> 

[jira] [Assigned] (SPARK-38183) Show warning when creating pandas-on-Spark session under ANSI mode.

2022-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38183:


Assignee: Haejoon Lee

> Show warning when creating pandas-on-Spark session under ANSI mode.
> ---
>
> Key: SPARK-38183
> URL: https://issues.apache.org/jira/browse/SPARK-38183
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Since pandas API on Spark follows the behavior of pandas, not SQL, some 
> unexpected behavior can be occurred when "spark.sql.ansi.enable" is True.
> For example,
>  * It raises exception when {{div}} & {{mod}} related methods returns null 
> (e.g. {{{}DataFrame.rmod{}}})
> {code:java}
> >>> df
>angels  degress
> 0   0  360
> 1   3  180
> 2   4  360
> >>> df.rmod(2)
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 32.0 (TID 165) (172.30.1.44 executor driver): 
> org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
> instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
> (except for ANSI interval type) to bypass this error.{code}
>  * It raises exception when DataFrame for {{ps.melt}} has not the same column 
> type.
>  
> {code:java}
> >>> df
>    A  B  C
> 0  a  1  2
> 1  b  3  4
> 2  c  5  6
> >>> ps.melt(df)
> Traceback (most recent call last):
> ...
> pyspark.sql.utils.AnalysisException: cannot resolve 'array(struct('A', A), 
> struct('B', B), struct('C', C))' due to data type mismatch: input to function 
> array should all be the same type, but it's 
> [struct, struct, 
> struct]
> To fix the error, you might need to add explicit type casts. If necessary set 
> spark.sql.ansi.enabled to false to bypass this error.;
> 'Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
> __natural_order__#231L, explode(array(struct(variable, A, value, A#224), 
> struct(variable, B, value, B#225L), struct(variable, C, value, C#226L))) AS 
> pairs#269]
> +- Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
> monotonically_increasing_id() AS __natural_order__#231L]
>    +- LogicalRDD [__index_level_0__#223L, A#224, B#225L, C#226L], false{code}
>  * It raises exception when {{CategoricalIndex.remove_categories}} doesn't 
> remove the entire index
> {code:java}
> >>> idx
> CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], 
> ordered=False, dtype='category')
> >>> idx.remove_categories('b')
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 
> 215)
> org.apache.spark.SparkNoSuchElementException: Key b does not exist. If 
> necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
> error.
> ...
> ...{code}
>  * It raises exception when {{CategoricalIndex.set_categories}} doesn't set 
> the entire index
> {code:java}
> >>> idx.set_categories(['b', 'c'])
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 
> 215)
> org.apache.spark.SparkNoSuchElementException: Key a does not exist. If 
> necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
> error.
> ...
> ...{code}
>  * It raises exception when {{ps.to_numeric}} get a non-numeric type
> {code:java}
> >>> psser
> 0    apple
> 1      1.0
> 2        2
> 3       -3
> dtype: object
> >>> ps.to_numeric(psser)
> 22/02/14 09:22:36 ERROR Executor: Exception in task 2.0 in stage 63.0 (TID 
> 328)
> org.apache.spark.SparkNumberFormatException: invalid input syntax for type 
> numeric: apple. To return NULL instead, use 'try_cast'. If necessary set 
> spark.sql.ansi.enabled to false to bypass this error.
> ...{code}
>  * It raises exception when {{strings.StringMethods.rsplit}} - also 
> {{strings.StringMethods.split}} - with {{expand=True}} returns null columns
> {code:java}
> >>> s
> 0                       this is a regular sentence
> 1    https://docs.python.org/3/tutorial/index.html
> 2                                             None
> dtype: object
> >>> s.str.split(n=4, expand=True)
> 22/02/14 09:26:23 ERROR Executor: Exception in task 5.0 in stage 69.0 (TID 
> 356)
> org.apache.spark.SparkArrayIndexOutOfBoundsException: Invalid index: 1, 
> numElements: 1. If necessary set spark.sql.ansi.strictIndexOperator to false 
> to bypass this error.{code}
>  * It raises exception when {{as_type}} with {{{}CategoricalDtype{}}}, and 
> the categories of {{CategoricalDtype}} is not matched with data.
> {code:java}
> >>> psser
> 0    1994-01-31
> 1    1994-02-01
> 2    1994-02-02
> dtype: object
> >>> cat_type
> CategoricalDtype(categories=['a', 

[jira] [Commented] (SPARK-38211) Add SQL migration guide on restoring loose upcast from string

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492325#comment-17492325
 ] 

Apache Spark commented on SPARK-38211:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35519

> Add SQL migration guide on restoring loose upcast from string
> -
>
> Key: SPARK-38211
> URL: https://issues.apache.org/jira/browse/SPARK-38211
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Manu Zhang
>Priority: Minor
>
> After SPARK-24586, loose upcasting from string to other types are not allowed 
> by default. User can still set {{spark.sql.legacy.looseUpcast=true}} to 
> preserve old behavior but it's not documented in the SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38211) Add SQL migration guide on restoring loose upcast from string

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38211:


Assignee: (was: Apache Spark)

> Add SQL migration guide on restoring loose upcast from string
> -
>
> Key: SPARK-38211
> URL: https://issues.apache.org/jira/browse/SPARK-38211
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Manu Zhang
>Priority: Minor
>
> After SPARK-24586, loose upcasting from string to other types are not allowed 
> by default. User can still set {{spark.sql.legacy.looseUpcast=true}} to 
> preserve old behavior but it's not documented in the SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38211) Add SQL migration guide on restoring loose upcast from string

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38211:


Assignee: Apache Spark

> Add SQL migration guide on restoring loose upcast from string
> -
>
> Key: SPARK-38211
> URL: https://issues.apache.org/jira/browse/SPARK-38211
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Manu Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> After SPARK-24586, loose upcasting from string to other types are not allowed 
> by default. User can still set {{spark.sql.legacy.looseUpcast=true}} to 
> preserve old behavior but it's not documented in the SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38211) Add SQL migration guide on restoring loose upcast from string

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492324#comment-17492324
 ] 

Apache Spark commented on SPARK-38211:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35519

> Add SQL migration guide on restoring loose upcast from string
> -
>
> Key: SPARK-38211
> URL: https://issues.apache.org/jira/browse/SPARK-38211
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Manu Zhang
>Priority: Minor
>
> After SPARK-24586, loose upcasting from string to other types are not allowed 
> by default. User can still set {{spark.sql.legacy.looseUpcast=true}} to 
> preserve old behavior but it's not documented in the SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38211) Add SQL migration guide on restoring loose upcast from string

2022-02-14 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-38211:
---
Summary: Add SQL migration guide on restoring loose upcast from string  
(was: Add SQL migration guide on preserving loose upcast )

> Add SQL migration guide on restoring loose upcast from string
> -
>
> Key: SPARK-38211
> URL: https://issues.apache.org/jira/browse/SPARK-38211
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Manu Zhang
>Priority: Minor
>
> After SPARK-24586, loose upcasting from string to other types are not allowed 
> by default. User can still set {{spark.sql.legacy.looseUpcast=true}} to 
> preserve old behavior but it's not documented in the SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38211) Add SQL migration guide on preserving loose upcast

2022-02-14 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-38211:
--

 Summary: Add SQL migration guide on preserving loose upcast 
 Key: SPARK-38211
 URL: https://issues.apache.org/jira/browse/SPARK-38211
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.2.1
Reporter: Manu Zhang


After SPARK-24586, loose upcasting from string to other types are not allowed 
by default. User can still set {{spark.sql.legacy.looseUpcast=true}} to 
preserve old behavior but it's not documented in the SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35173) Support columns batch adding in PySpark.dataframe

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492320#comment-17492320
 ] 

Apache Spark commented on SPARK-35173:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35518

> Support columns batch adding in PySpark.dataframe
> -
>
> Key: SPARK-35173
> URL: https://issues.apache.org/jira/browse/SPARK-35173
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
> Now, the pyspark can only use withColumn to do column adding a column or 
> replacing the existing column that has the same name. The scala withColumn 
> can adding columns at one pass. [1]
>  
> Before this added, the user can only use withColumn again and again like:
>  
> {code:java}
> self.df.withColumn("key1", col("key1")).withColumn("key2", 
> col("key2")).withColumn("key3", col("key3")){code}
>  
> After the support, you user can use the with_columns complete batch 
> operations:
>  
> {code:java}
> self.df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), 
> col("key3")]){code}
>  
> [1] 
> [https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38121) Use SparkSession instead of SQLContext inside PySpark

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492310#comment-17492310
 ] 

Apache Spark commented on SPARK-38121:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35517

> Use SparkSession instead of SQLContext inside PySpark
> -
>
> Key: SPARK-38121
> URL: https://issues.apache.org/jira/browse/SPARK-38121
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> We use SQLContext within PySpark. We should remove such usage away to 
> properly respect SparkSession and runtime configuration, etc.
> Should also expose {{df.sparkSession}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38121) Use SparkSession instead of SQLContext inside PySpark

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492309#comment-17492309
 ] 

Apache Spark commented on SPARK-38121:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35517

> Use SparkSession instead of SQLContext inside PySpark
> -
>
> Key: SPARK-38121
> URL: https://issues.apache.org/jira/browse/SPARK-38121
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> We use SQLContext within PySpark. We should remove such usage away to 
> properly respect SparkSession and runtime configuration, etc.
> Should also expose {{df.sparkSession}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38121) Use SparkSession instead of SQLContext inside PySpark

2022-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38121.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35410
[https://github.com/apache/spark/pull/35410]

> Use SparkSession instead of SQLContext inside PySpark
> -
>
> Key: SPARK-38121
> URL: https://issues.apache.org/jira/browse/SPARK-38121
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> We use SQLContext within PySpark. We should remove such usage away to 
> properly respect SparkSession and runtime configuration, etc.
> Should also expose {{df.sparkSession}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38121) Use SparkSession instead of SQLContext inside PySpark

2022-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38121:


Assignee: Hyukjin Kwon

> Use SparkSession instead of SQLContext inside PySpark
> -
>
> Key: SPARK-38121
> URL: https://issues.apache.org/jira/browse/SPARK-38121
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We use SQLContext within PySpark. We should remove such usage away to 
> properly respect SparkSession and runtime configuration, etc.
> Should also expose {{df.sparkSession}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35173) Support columns batch adding in PySpark.dataframe

2022-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35173.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 32431
[https://github.com/apache/spark/pull/32431]

> Support columns batch adding in PySpark.dataframe
> -
>
> Key: SPARK-35173
> URL: https://issues.apache.org/jira/browse/SPARK-35173
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
> Now, the pyspark can only use withColumn to do column adding a column or 
> replacing the existing column that has the same name. The scala withColumn 
> can adding columns at one pass. [1]
>  
> Before this added, the user can only use withColumn again and again like:
>  
> {code:java}
> self.df.withColumn("key1", col("key1")).withColumn("key2", 
> col("key2")).withColumn("key3", col("key3")){code}
>  
> After the support, you user can use the with_columns complete batch 
> operations:
>  
> {code:java}
> self.df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), 
> col("key3")]){code}
>  
> [1] 
> [https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35173) Support columns batch adding in PySpark.dataframe

2022-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35173:


Assignee: Yikun Jiang

> Support columns batch adding in PySpark.dataframe
> -
>
> Key: SPARK-35173
> URL: https://issues.apache.org/jira/browse/SPARK-35173
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> Now, the pyspark can only use withColumn to do column adding a column or 
> replacing the existing column that has the same name. The scala withColumn 
> can adding columns at one pass. [1]
>  
> Before this added, the user can only use withColumn again and again like:
>  
> {code:java}
> self.df.withColumn("key1", col("key1")).withColumn("key2", 
> col("key2")).withColumn("key3", col("key3")){code}
>  
> After the support, you user can use the with_columns complete batch 
> operations:
>  
> {code:java}
> self.df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), 
> col("key3")]){code}
>  
> [1] 
> [https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38210) Spark documentation build README is stale

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492254#comment-17492254
 ] 

Apache Spark commented on SPARK-38210:
--

User 'khalidmammadov' has created a pull request for this issue:
https://github.com/apache/spark/pull/35516

> Spark documentation build README is stale
> -
>
> Key: SPARK-38210
> URL: https://issues.apache.org/jira/browse/SPARK-38210
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Khalid Mammadov
>Priority: Minor
>
> I was following docs/README.md to build documentation and found out that it's 
> not complete. I had to install additional packages that is not documented but 
> available in the [CI/CD phase 
> |https://github.com/apache/spark/blob/c8b34ab7340265f1f2bec2afa694c10f174b222c/.github/workflows/build_and_test.yml#L526]and
>  few more to finish the build process.
> I will file a PR to change README.md to include these packages and improve 
> the guide



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38210) Spark documentation build README is stale

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38210:


Assignee: (was: Apache Spark)

> Spark documentation build README is stale
> -
>
> Key: SPARK-38210
> URL: https://issues.apache.org/jira/browse/SPARK-38210
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Khalid Mammadov
>Priority: Minor
>
> I was following docs/README.md to build documentation and found out that it's 
> not complete. I had to install additional packages that is not documented but 
> available in the [CI/CD phase 
> |https://github.com/apache/spark/blob/c8b34ab7340265f1f2bec2afa694c10f174b222c/.github/workflows/build_and_test.yml#L526]and
>  few more to finish the build process.
> I will file a PR to change README.md to include these packages and improve 
> the guide



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38210) Spark documentation build README is stale

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492253#comment-17492253
 ] 

Apache Spark commented on SPARK-38210:
--

User 'khalidmammadov' has created a pull request for this issue:
https://github.com/apache/spark/pull/35516

> Spark documentation build README is stale
> -
>
> Key: SPARK-38210
> URL: https://issues.apache.org/jira/browse/SPARK-38210
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Khalid Mammadov
>Priority: Minor
>
> I was following docs/README.md to build documentation and found out that it's 
> not complete. I had to install additional packages that is not documented but 
> available in the [CI/CD phase 
> |https://github.com/apache/spark/blob/c8b34ab7340265f1f2bec2afa694c10f174b222c/.github/workflows/build_and_test.yml#L526]and
>  few more to finish the build process.
> I will file a PR to change README.md to include these packages and improve 
> the guide



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38210) Spark documentation build README is stale

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38210:


Assignee: Apache Spark

> Spark documentation build README is stale
> -
>
> Key: SPARK-38210
> URL: https://issues.apache.org/jira/browse/SPARK-38210
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Khalid Mammadov
>Assignee: Apache Spark
>Priority: Minor
>
> I was following docs/README.md to build documentation and found out that it's 
> not complete. I had to install additional packages that is not documented but 
> available in the [CI/CD phase 
> |https://github.com/apache/spark/blob/c8b34ab7340265f1f2bec2afa694c10f174b222c/.github/workflows/build_and_test.yml#L526]and
>  few more to finish the build process.
> I will file a PR to change README.md to include these packages and improve 
> the guide



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38183) Show warning when creating pandas-on-Spark session under ANSI mode.

2022-02-14 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-38183:

Description: 
Since pandas API on Spark follows the behavior of pandas, not SQL, some 
unexpected behavior can be occurred when "spark.sql.ansi.enable" is True.

For example,
 * It raises exception when {{div}} & {{mod}} related methods returns null 
(e.g. {{{}DataFrame.rmod{}}})

{code:java}
>>> df
   angels  degress
0   0  360
1   3  180
2   4  360
>>> df.rmod(2)
Traceback (most recent call last):
...
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 32.0 
(TID 165) (172.30.1.44 executor driver): 
org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
(except for ANSI interval type) to bypass this error.{code}
 * It raises exception when DataFrame for {{ps.melt}} has not the same column 
type.

 
{code:java}
>>> df
   A  B  C
0  a  1  2
1  b  3  4
2  c  5  6
>>> ps.melt(df)
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve 'array(struct('A', A), 
struct('B', B), struct('C', C))' due to data type mismatch: input to function 
array should all be the same type, but it's 
[struct, struct, 
struct]
To fix the error, you might need to add explicit type casts. If necessary set 
spark.sql.ansi.enabled to false to bypass this error.;
'Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
__natural_order__#231L, explode(array(struct(variable, A, value, A#224), 
struct(variable, B, value, B#225L), struct(variable, C, value, C#226L))) AS 
pairs#269]
+- Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
monotonically_increasing_id() AS __natural_order__#231L]
   +- LogicalRDD [__index_level_0__#223L, A#224, B#225L, C#226L], false{code}
 * It raises exception when {{CategoricalIndex.remove_categories}} doesn't 
remove the entire index

{code:java}
>>> idx
CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], 
ordered=False, dtype='category')
>>> idx.remove_categories('b')
22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 215)
org.apache.spark.SparkNoSuchElementException: Key b does not exist. If 
necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.
...
...{code}
 * It raises exception when {{CategoricalIndex.set_categories}} doesn't set the 
entire index

{code:java}
>>> idx.set_categories(['b', 'c'])
22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 215)
org.apache.spark.SparkNoSuchElementException: Key a does not exist. If 
necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.
...
...{code}
 * It raises exception when {{ps.to_numeric}} get a non-numeric type

{code:java}
>>> psser
0    apple
1      1.0
2        2
3       -3
dtype: object
>>> ps.to_numeric(psser)
22/02/14 09:22:36 ERROR Executor: Exception in task 2.0 in stage 63.0 (TID 328)
org.apache.spark.SparkNumberFormatException: invalid input syntax for type 
numeric: apple. To return NULL instead, use 'try_cast'. If necessary set 
spark.sql.ansi.enabled to false to bypass this error.
...{code}
 * It raises exception when {{strings.StringMethods.rsplit}} - also 
{{strings.StringMethods.split}} - with {{expand=True}} returns null columns

{code:java}
>>> s
0                       this is a regular sentence
1    https://docs.python.org/3/tutorial/index.html
2                                             None
dtype: object
>>> s.str.split(n=4, expand=True)
22/02/14 09:26:23 ERROR Executor: Exception in task 5.0 in stage 69.0 (TID 356)
org.apache.spark.SparkArrayIndexOutOfBoundsException: Invalid index: 1, 
numElements: 1. If necessary set spark.sql.ansi.strictIndexOperator to false to 
bypass this error.{code}
 * It raises exception when {{as_type}} with {{{}CategoricalDtype{}}}, and the 
categories of {{CategoricalDtype}} is not matched with data.

{code:java}
>>> psser
0    1994-01-31
1    1994-02-01
2    1994-02-02
dtype: object
>>> cat_type
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
>>> psser.astype(cat_type)
22/02/14 09:34:56 ERROR Executor: Exception in task 5.0 in stage 90.0 (TID 468)
org.apache.spark.SparkNoSuchElementException: Key 1994-02-01 does not exist. If 
necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
error.{code}
Not only for the example cases, if the internal SQL function used to implement 
the function has different behavior according to ANSI options, an unexpected 
error may occur.

So we might need to show proper warning message when creating pandas-on-Spark 
session.

  was:
Since pandas API on Spark follows the behavior of pandas, not SQL, some 
unexpected behavior can be occurred when "spark.sql.ansi.enable" 

[jira] [Created] (SPARK-38210) Spark documentation build README is stale

2022-02-14 Thread Khalid Mammadov (Jira)
Khalid Mammadov created SPARK-38210:
---

 Summary: Spark documentation build README is stale
 Key: SPARK-38210
 URL: https://issues.apache.org/jira/browse/SPARK-38210
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.2.1
Reporter: Khalid Mammadov


I was following docs/README.md to build documentation and found out that it's 
not complete. I had to install additional packages that is not documented but 
available in the [CI/CD phase 
|https://github.com/apache/spark/blob/c8b34ab7340265f1f2bec2afa694c10f174b222c/.github/workflows/build_and_test.yml#L526]and
 few more to finish the build process.

I will file a PR to change README.md to include these packages and improve the 
guide



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1

2022-02-14 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492248#comment-17492248
 ] 

Ismael Juma commented on SPARK-36808:
-

I think it would be good to backport this to the stable version of Spark as it 
includes a number of important fixes:

[https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html]

> Upgrade Kafka to 2.8.1
> --
>
> Key: SPARK-36808
> URL: https://issues.apache.org/jira/browse/SPARK-36808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix.
> https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33913) Upgrade Kafka to 2.8.0

2022-02-14 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492247#comment-17492247
 ] 

Ismael Juma commented on SPARK-33913:
-

Sounds good, I had not seen the other ticket.

> Upgrade Kafka to 2.8.0
> --
>
> Key: SPARK-33913
> URL: https://issues.apache.org/jira/browse/SPARK-33913
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, DStreams
>Affects Versions: 3.2.0
>Reporter: dengziming
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>
> This issue aims to upgrade Kafka client to 2.8.0.
> Note that Kafka 2.8.0 uses ZSTD JNI 1.4.9-1 like Apache Spark 3.2.0.
> *RELEASE NOTE*
> - https://downloads.apache.org/kafka/2.8.0/RELEASE_NOTES.html
> - https://downloads.apache.org/kafka/2.7.0/RELEASE_NOTES.html
> This will bring the latest client-side improvement and bug fixes like the 
> following examples.
> - KAFKA-10631 ProducerFencedException is not Handled on Offest Commit
> - KAFKA-10134 High CPU issue during rebalance in Kafka consumer after 
> upgrading to 2.5
> - KAFKA-12193 Re-resolve IPs when a client is disconnected
> - KAFKA-10090 Misleading warnings: The configuration was supplied but isn't a 
> known config
> - KAFKA-9263 The new hw is added to incorrect log when  
> ReplicaAlterLogDirsThread is replacing log 
> - KAFKA-10607 Ensure the error counts contains the NONE
> - KAFKA-10458 Need a way to update quota for TokenBucket registered with 
> Sensor
> - KAFKA-10503 MockProducer doesn't throw ClassCastException when no partition 
> for topic



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33913) Upgrade Kafka to 2.8.0

2022-02-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492242#comment-17492242
 ] 

Dongjoon Hyun commented on SPARK-33913:
---

Actually, you had better ping on SPARK-36808 (Upgrade Kafka to 2.8.1) instead 
of this 2.8.0 JIRA, [~ijuma].
BTW, +1 for the suggestion.

> Upgrade Kafka to 2.8.0
> --
>
> Key: SPARK-33913
> URL: https://issues.apache.org/jira/browse/SPARK-33913
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, DStreams
>Affects Versions: 3.2.0
>Reporter: dengziming
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>
> This issue aims to upgrade Kafka client to 2.8.0.
> Note that Kafka 2.8.0 uses ZSTD JNI 1.4.9-1 like Apache Spark 3.2.0.
> *RELEASE NOTE*
> - https://downloads.apache.org/kafka/2.8.0/RELEASE_NOTES.html
> - https://downloads.apache.org/kafka/2.7.0/RELEASE_NOTES.html
> This will bring the latest client-side improvement and bug fixes like the 
> following examples.
> - KAFKA-10631 ProducerFencedException is not Handled on Offest Commit
> - KAFKA-10134 High CPU issue during rebalance in Kafka consumer after 
> upgrading to 2.5
> - KAFKA-12193 Re-resolve IPs when a client is disconnected
> - KAFKA-10090 Misleading warnings: The configuration was supplied but isn't a 
> known config
> - KAFKA-9263 The new hw is added to incorrect log when  
> ReplicaAlterLogDirsThread is replacing log 
> - KAFKA-10607 Ensure the error counts contains the NONE
> - KAFKA-10458 Need a way to update quota for TokenBucket registered with 
> Sensor
> - KAFKA-10503 MockProducer doesn't throw ClassCastException when no partition 
> for topic



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38144) Remove unused `spark.storage.safetyFraction` config

2022-02-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38144:
--
Fix Version/s: 3.1.4
   (was: 3.1.3)

> Remove unused `spark.storage.safetyFraction` config
> ---
>
> Key: SPARK-38144
> URL: https://issues.apache.org/jira/browse/SPARK-38144
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1

2022-02-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38030:
--
Fix Version/s: 3.1.4
   (was: 3.1.3)

> Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
> -
>
> Key: SPARK-38030
> URL: https://issues.apache.org/jira/browse/SPARK-38030
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Shardul Mahadik
>Assignee: Shardul Mahadik
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
>
> One of our user queries failed in Spark 3.1.1 when using AQE with the 
> following stacktrace mentioned below (some parts of the plan have been 
> redacted, but the structure is preserved).
> Debugging this issue, we found that the failure was within AQE calling 
> [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402].
> The query contains a cast over a column with non-nullable struct fields. 
> Canonicalization [removes nullability 
> information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45]
>  from the child {{AttributeReference}} of the Cast, however it does not 
> remove nullability information from the Cast's target dataType. This causes 
> the 
> [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290]
>  to return false because the child is now nullable and cast target data type 
> is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}.
> {code:java}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232]
> +- Union
>:- Project [cast(columnA#30) as struct<...>]
>:  +- BatchScan[columnA#30] hive.tbl 
>+- Project [cast(columnA#35) as struct<...>]
>   +- BatchScan[columnA#35] hive.tbl
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279)
>   at 

[jira] [Updated] (SPARK-38120) HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value

2022-02-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38120:
--
Fix Version/s: 3.1.4
   (was: 3.1.3)

> HiveExternalCatalog.listPartitions is failing when partition column name is 
> upper case and dot in partition value
> -
>
> Key: SPARK-38120
> URL: https://issues.apache.org/jira/browse/SPARK-38120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.1
>Reporter: Khalid Mammadov
>Assignee: Khalid Mammadov
>Priority: Minor
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
>
> HiveExternalCatalog.listPartitions method call is failing when a partition 
> column name is upper case and partition value contains dot. It's related to 
> this change 
> [https://github.com/apache/spark/commit/f18b905f6cace7686ef169fda7de474079d0af23]
> The test casein that PR does not produce the issue as partition column name 
> is lower case.
>  
> Below how to reproduce the issue:
> scala> import org.apache.spark.sql.catalyst.TableIdentifier
> import org.apache.spark.sql.catalyst.TableIdentifier
> scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY 
> (partCol1 STRING, partCol2 STRING)")
> scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 = 
> 'i.j') VALUES (100, 'John')")                               
> scala> spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), 
> Some(Map("partCol2" -> "i.j"))).foreach(println)
> java.util.NoSuchElementException: key not found: partcol2
>   at scala.collection.immutable.Map$Map2.apply(Map.scala:227)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1(ExternalCatalogUtils.scala:205)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1$adapted(ExternalCatalogUtils.scala:202)
>   at scala.collection.immutable.Map$Map1.forall(Map.scala:196)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.isPartialPartitionSpec(ExternalCatalogUtils.scala:202)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6(HiveExternalCatalog.scala:1312)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6$adapted(HiveExternalCatalog.scala:1312)
>   at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>   at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>   at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1312)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClientWrappingException(HiveExternalCatalog.scala:114)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1296)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1251)
>   ... 47 elided



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38151) Handle `Pacific/Kanton` in DateTimeUtilsSuite

2022-02-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38151:
--
Fix Version/s: 3.1.4
   (was: 3.1.3)

> Handle `Pacific/Kanton` in DateTimeUtilsSuite
> -
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
>
> This issue aims to fix the flaky UT failures due to 
> https://bugs.openjdk.java.net/browse/JDK-8274407 (Update Timezone Data to 
> 2021c) and its backport commits that renamed 'Pacific/Enderbury' to 
> 'Pacific/Kanton' in the latest Java 17.0.2, 11.0.14, and 8u311.
> Rename Pacific/Enderbury to Pacific/Kanton.
> **MASTER**
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}
> **BRANCH-3.2**
> - https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:771)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38056) Structured streaming not working in history server when using LevelDB

2022-02-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38056:
--
Fix Version/s: 3.1.4
   (was: 3.1.3)

> Structured streaming not working in history server when using LevelDB
> -
>
> Key: SPARK-38056
> URL: https://issues.apache.org/jira/browse/SPARK-38056
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.1.2, 3.2.0
>Reporter: wy
>Assignee: wy
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: local-1643373518829
>
>
> In 
> [SPARK-31953|https://github.com/apache/spark/commit/4f9667035886a67e6c9a4e8fad2efa390e87ca68],
>  structured streaming support is added to history server. However this does 
> not work when spark.history.store.path is set to save app info using LevelDB.
> This is because one of the keys of StreamingQueryData, runId,  is UUID type, 
> which is not supported by LevelDB. When replaying event log file in history 
> server, StreamingQueryStatusListener will throw an exception when writing 
> info to the store, saying "java.lang.IllegalArgumentException: Type 
> java.util.UUID not allowed as key.".
> Example event log is provided in attachments. When opening it in history 
> server with spark.history.store.path set to somewhere, no structured 
> streaming info is available.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38208) 'Column' object is not callable

2022-02-14 Thread Joyce Arruda Recacho (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joyce Arruda Recacho resolved SPARK-38208.
--
Fix Version/s: 3.1.2
   Resolution: Fixed

> 'Column' object is not callable
> ---
>
> Key: SPARK-38208
> URL: https://issues.apache.org/jira/browse/SPARK-38208
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Joyce Arruda Recacho
>Priority: Major
> Fix For: 3.1.2
>
>
> Hi guys, I have such simple dataframe and am trying to create one new column.
> That its schema:
>  
>  df_operation_event_sellers.schema 
> Out[69]: 
> StructType(List(StructField(id,StringType,true),StructField(account_id,StringType,true),StructField(p_tenant_id,StringType,true),StructField(vendor_id,StringType,true),StructField(amount,DecimalType(38,18),true),StructField(operation_type,StringType,true),StructField(reference_id,StringType,true),StructField(date,TimestampType,true),StructField(carrier_id,StringType,true),StructField(account_number,StringType,true),StructField(data_source,StringType,true),StructField(entity,StringType,true),StructField(ingestion_date,DateType,true),StructField(event_type,StringType,false),StructField(amount_new,DecimalType(38,18),true),StructField(date_new,IntegerType,true),StructField(row_num,IntegerType,true)))
>  
> >>> command to create the new column
> df_operation_event_sellers= 
> df_operation_event_sellers.withColumn('flag_first_selling',when(col('row_num')
>  == 1,'YES').instead('NO'))
> ISSUE  TypeError: 'Column' object is not callable
>  
> What is happing?
> ps. I created other columns the same way successfully
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38208) 'Column' object is not callable

2022-02-14 Thread Joyce Arruda Recacho (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492114#comment-17492114
 ] 

Joyce Arruda Recacho edited comment on SPARK-38208 at 2/14/22, 5:20 PM:


the correct is 'otherwise' and not 'instead' for the 'else' condition.


was (Author: JIRAUSER285189):
the correct is 'otherwise' and not 'instead'.

> 'Column' object is not callable
> ---
>
> Key: SPARK-38208
> URL: https://issues.apache.org/jira/browse/SPARK-38208
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Joyce Arruda Recacho
>Priority: Major
>
> Hi guys, I have such simple dataframe and am trying to create one new column.
> That its schema:
>  
>  df_operation_event_sellers.schema 
> Out[69]: 
> StructType(List(StructField(id,StringType,true),StructField(account_id,StringType,true),StructField(p_tenant_id,StringType,true),StructField(vendor_id,StringType,true),StructField(amount,DecimalType(38,18),true),StructField(operation_type,StringType,true),StructField(reference_id,StringType,true),StructField(date,TimestampType,true),StructField(carrier_id,StringType,true),StructField(account_number,StringType,true),StructField(data_source,StringType,true),StructField(entity,StringType,true),StructField(ingestion_date,DateType,true),StructField(event_type,StringType,false),StructField(amount_new,DecimalType(38,18),true),StructField(date_new,IntegerType,true),StructField(row_num,IntegerType,true)))
>  
> >>> command to create the new column
> df_operation_event_sellers= 
> df_operation_event_sellers.withColumn('flag_first_selling',when(col('row_num')
>  == 1,'YES').instead('NO'))
> ISSUE  TypeError: 'Column' object is not callable
>  
> What is happing?
> ps. I created other columns the same way successfully
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38208) 'Column' object is not callable

2022-02-14 Thread Joyce Arruda Recacho (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492114#comment-17492114
 ] 

Joyce Arruda Recacho commented on SPARK-38208:
--

the correct is 'otherwise' and not 'instead'.

> 'Column' object is not callable
> ---
>
> Key: SPARK-38208
> URL: https://issues.apache.org/jira/browse/SPARK-38208
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Joyce Arruda Recacho
>Priority: Major
>
> Hi guys, I have such simple dataframe and am trying to create one new column.
> That its schema:
>  
>  df_operation_event_sellers.schema 
> Out[69]: 
> StructType(List(StructField(id,StringType,true),StructField(account_id,StringType,true),StructField(p_tenant_id,StringType,true),StructField(vendor_id,StringType,true),StructField(amount,DecimalType(38,18),true),StructField(operation_type,StringType,true),StructField(reference_id,StringType,true),StructField(date,TimestampType,true),StructField(carrier_id,StringType,true),StructField(account_number,StringType,true),StructField(data_source,StringType,true),StructField(entity,StringType,true),StructField(ingestion_date,DateType,true),StructField(event_type,StringType,false),StructField(amount_new,DecimalType(38,18),true),StructField(date_new,IntegerType,true),StructField(row_num,IntegerType,true)))
>  
> >>> command to create the new column
> df_operation_event_sellers= 
> df_operation_event_sellers.withColumn('flag_first_selling',when(col('row_num')
>  == 1,'YES').instead('NO'))
> ISSUE  TypeError: 'Column' object is not callable
>  
> What is happing?
> ps. I created other columns the same way successfully
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38208) 'Column' object is not callable

2022-02-14 Thread Joyce Arruda Recacho (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joyce Arruda Recacho updated SPARK-38208:
-
Affects Version/s: 3.1.2

> 'Column' object is not callable
> ---
>
> Key: SPARK-38208
> URL: https://issues.apache.org/jira/browse/SPARK-38208
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Joyce Arruda Recacho
>Priority: Major
>
> Hi guys, I have such simple dataframe and am trying to create one new column.
> That its schema:
>  
>  df_operation_event_sellers.schema 
> Out[69]: 
> StructType(List(StructField(id,StringType,true),StructField(account_id,StringType,true),StructField(p_tenant_id,StringType,true),StructField(vendor_id,StringType,true),StructField(amount,DecimalType(38,18),true),StructField(operation_type,StringType,true),StructField(reference_id,StringType,true),StructField(date,TimestampType,true),StructField(carrier_id,StringType,true),StructField(account_number,StringType,true),StructField(data_source,StringType,true),StructField(entity,StringType,true),StructField(ingestion_date,DateType,true),StructField(event_type,StringType,false),StructField(amount_new,DecimalType(38,18),true),StructField(date_new,IntegerType,true),StructField(row_num,IntegerType,true)))
>  
> >>> command to create the new column
> df_operation_event_sellers= 
> df_operation_event_sellers.withColumn('flag_first_selling',when(col('row_num')
>  == 1,'YES').instead('NO'))
> ISSUE  TypeError: 'Column' object is not callable
>  
> What is happing?
> ps. I created other columns the same way successfully
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38198) Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when `explainMode` is `CodegenMode`

2022-02-14 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38198:
-
Fix Version/s: 3.2.2

> Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when 
> `explainMode` is `CodegenMode`
> ---
>
> Key: SPARK-38198
> URL: https://issues.apache.org/jira/browse/SPARK-38198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0, 3.2.2
>
>
> `QueryExecution.debug#toFile` method supports passing in `maxFields` and this 
> parameter will be passed down when `explainMode` is SimpleMode, ExtendedMode, 
> or CostMode, but the passed down `maxFields` was ignored because 
> `QueryExecution#stringWithStats` overrides it with 
> `SQLConf.get.maxToStringFields` at present
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38209) Selectively include EXTERNAL TABLE source files via REGEX

2022-02-14 Thread melin (Jira)
melin created SPARK-38209:
-

 Summary: Selectively include EXTERNAL TABLE source files via REGEX
 Key: SPARK-38209
 URL: https://issues.apache.org/jira/browse/SPARK-38209
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: melin


https://issues.apache.org/jira/browse/HIVE-951

CREATE EXTERNAL TABLE should allow users to cherry-pick files via regular 
expression.
CREATE EXTERNAL TABLE was designed to allow users to access data that exists 
outside of Hive, and
currently makes the assumption that all of the files located under the supplied 
path should be included
in the new table. Users frequently encounter directories containing multiple
datasets, or directories that contain data in heterogeneous schemas, and it's 
often
impractical or impossible to adjust the layout of the directory to meet the 
requirements of
CREATE EXTERNAL TABLE. A good example of this problem is creating an external 
table based
on the contents of an S3 bucket.

One way to solve this problem is to extend the syntax of CREATE EXTERNAL TABLE
as follows:

CREATE EXTERNAL TABLE
...
LOCATION path [file_regex]
...



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38208) 'Column' object is not callable

2022-02-14 Thread Joyce Arruda Recacho (Jira)
Joyce Arruda Recacho created SPARK-38208:


 Summary: 'Column' object is not callable
 Key: SPARK-38208
 URL: https://issues.apache.org/jira/browse/SPARK-38208
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 3.2.1
Reporter: Joyce Arruda Recacho


Hi guys, I have such simple dataframe and am trying to create one new column.

That its schema:

 

 df_operation_event_sellers.schema 

Out[69]: 
StructType(List(StructField(id,StringType,true),StructField(account_id,StringType,true),StructField(p_tenant_id,StringType,true),StructField(vendor_id,StringType,true),StructField(amount,DecimalType(38,18),true),StructField(operation_type,StringType,true),StructField(reference_id,StringType,true),StructField(date,TimestampType,true),StructField(carrier_id,StringType,true),StructField(account_number,StringType,true),StructField(data_source,StringType,true),StructField(entity,StringType,true),StructField(ingestion_date,DateType,true),StructField(event_type,StringType,false),StructField(amount_new,DecimalType(38,18),true),StructField(date_new,IntegerType,true),StructField(row_num,IntegerType,true)))

 

>>> command to create the new column

df_operation_event_sellers= 
df_operation_event_sellers.withColumn('flag_first_selling',when(col('row_num') 
== 1,'YES').instead('NO'))



ISSUE  TypeError: 'Column' object is not callable

 

What is happing?

ps. I created other columns the same way successfully

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38198) Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when `explainMode` is `CodegenMode`

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491980#comment-17491980
 ] 

Apache Spark commented on SPARK-38198:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35515

> Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when 
> `explainMode` is `CodegenMode`
> ---
>
> Key: SPARK-38198
> URL: https://issues.apache.org/jira/browse/SPARK-38198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> `QueryExecution.debug#toFile` method supports passing in `maxFields` and this 
> parameter will be passed down when `explainMode` is SimpleMode, ExtendedMode, 
> or CostMode, but the passed down `maxFields` was ignored because 
> `QueryExecution#stringWithStats` overrides it with 
> `SQLConf.get.maxToStringFields` at present
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38198) Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when `explainMode` is `CodegenMode`

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491979#comment-17491979
 ] 

Apache Spark commented on SPARK-38198:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35515

> Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when 
> `explainMode` is `CodegenMode`
> ---
>
> Key: SPARK-38198
> URL: https://issues.apache.org/jira/browse/SPARK-38198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> `QueryExecution.debug#toFile` method supports passing in `maxFields` and this 
> parameter will be passed down when `explainMode` is SimpleMode, ExtendedMode, 
> or CostMode, but the passed down `maxFields` was ignored because 
> `QueryExecution#stringWithStats` overrides it with 
> `SQLConf.get.maxToStringFields` at present
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38097) Improve the error for pivoting of unsupported value types

2022-02-14 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38097.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35433
[https://github.com/apache/spark/pull/35433]

> Improve the error for pivoting of unsupported value types
> -
>
> Key: SPARK-38097
> URL: https://issues.apache.org/jira/browse/SPARK-38097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Yuto Akutsu
>Priority: Major
> Fix For: 3.3.0
>
>
> The error message from:
> {code:scala}
>   test("Improve the error for pivoting of unsupported value types") {
> trainingSales
>   .groupBy($"sales.year")
>   .pivot(struct(lower($"sales.course"), $"training"))
>   .agg(sum($"sales.earnings"))
>   .show(false)
>   }
> {code}
> can confuse users:
> {code:java}
> The feature is not supported: literal for '[dotnet,Dummies]' of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for '[dotnet,Dummies]' of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:245)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset.$anonfun$pivot$2(RelationalGroupedDataset.scala:455)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
> {code}
> Need to improve the error message and make it more precise.
> See https://github.com/apache/spark/pull/35302#discussion_r793629370



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38097) Improve the error for pivoting of unsupported value types

2022-02-14 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38097:


Assignee: Yuto Akutsu

> Improve the error for pivoting of unsupported value types
> -
>
> Key: SPARK-38097
> URL: https://issues.apache.org/jira/browse/SPARK-38097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Yuto Akutsu
>Priority: Major
>
> The error message from:
> {code:scala}
>   test("Improve the error for pivoting of unsupported value types") {
> trainingSales
>   .groupBy($"sales.year")
>   .pivot(struct(lower($"sales.course"), $"training"))
>   .agg(sum($"sales.earnings"))
>   .show(false)
>   }
> {code}
> can confuse users:
> {code:java}
> The feature is not supported: literal for '[dotnet,Dummies]' of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for '[dotnet,Dummies]' of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:245)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset.$anonfun$pivot$2(RelationalGroupedDataset.scala:455)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
> {code}
> Need to improve the error message and make it more precise.
> See https://github.com/apache/spark/pull/35302#discussion_r793629370



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38027) Undefined link function causing error in GLM that uses Tweedie family

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491956#comment-17491956
 ] 

Apache Spark commented on SPARK-38027:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35514

> Undefined link function causing error in GLM that uses Tweedie family
> -
>
> Key: SPARK-38027
> URL: https://issues.apache.org/jira/browse/SPARK-38027
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.1.2
> Environment: Running on Mac OS X Monterey
>Reporter: Evan Zamir
>Priority: Major
>  Labels: GLM, pyspark
>
> I am trying to use the GLM regression with a Tweedie distribution so I can 
> model insurance use cases. I have set up a very simple example adapted from 
> the docs:
> {code:python}
> def create_fake_losses_data(self):
> df = self._spark.createDataFrame([
> ("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)),
> ("b", 0.0, 12, 1, Vectors.dense(1.0, 2.0)),
> ("c", 0.0, 12, 1, Vectors.dense(0.0, 0.0)),
> ("d", 2000.0, 12, 1, Vectors.dense(1.0, 1.0)), ], ["user", 
> "label", "offset", "weight", "features"])
> logging.info(df.collect())
> setattr(self, 'fake_data', df)
> try:
> glr = GeneralizedLinearRegression(
> family="tweedie", variancePower=1.5, linkPower=-1, 
> offsetCol='offset')
> glr.setRegParam(0.3)
> model = glr.fit(df)
> logging.info(model)
> except Py4JJavaError as e:
> print(e)
> return self
> {code}
> This causes the following error:
> *py4j.protocol.Py4JJavaError: An error occurred while calling o99.toString.
> : java.util.NoSuchElementException: Failed to find a default value for link*
> at 
> org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756)
> at scala.Option.getOrElse(Option.scala:189)
> at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756)
> at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753)
> at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41)
> at org.apache.spark.ml.param.Params.$(params.scala:762)
> at org.apache.spark.ml.param.Params.$$(params.scala:762)
> at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41)
> at 
> org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748)
> I was under the assumption that the default value for link is None, if not 
> defined otherwise.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38027) Undefined link function causing error in GLM that uses Tweedie family

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38027:


Assignee: Apache Spark

> Undefined link function causing error in GLM that uses Tweedie family
> -
>
> Key: SPARK-38027
> URL: https://issues.apache.org/jira/browse/SPARK-38027
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.1.2
> Environment: Running on Mac OS X Monterey
>Reporter: Evan Zamir
>Assignee: Apache Spark
>Priority: Major
>  Labels: GLM, pyspark
>
> I am trying to use the GLM regression with a Tweedie distribution so I can 
> model insurance use cases. I have set up a very simple example adapted from 
> the docs:
> {code:python}
> def create_fake_losses_data(self):
> df = self._spark.createDataFrame([
> ("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)),
> ("b", 0.0, 12, 1, Vectors.dense(1.0, 2.0)),
> ("c", 0.0, 12, 1, Vectors.dense(0.0, 0.0)),
> ("d", 2000.0, 12, 1, Vectors.dense(1.0, 1.0)), ], ["user", 
> "label", "offset", "weight", "features"])
> logging.info(df.collect())
> setattr(self, 'fake_data', df)
> try:
> glr = GeneralizedLinearRegression(
> family="tweedie", variancePower=1.5, linkPower=-1, 
> offsetCol='offset')
> glr.setRegParam(0.3)
> model = glr.fit(df)
> logging.info(model)
> except Py4JJavaError as e:
> print(e)
> return self
> {code}
> This causes the following error:
> *py4j.protocol.Py4JJavaError: An error occurred while calling o99.toString.
> : java.util.NoSuchElementException: Failed to find a default value for link*
> at 
> org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756)
> at scala.Option.getOrElse(Option.scala:189)
> at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756)
> at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753)
> at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41)
> at org.apache.spark.ml.param.Params.$(params.scala:762)
> at org.apache.spark.ml.param.Params.$$(params.scala:762)
> at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41)
> at 
> org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748)
> I was under the assumption that the default value for link is None, if not 
> defined otherwise.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38027) Undefined link function causing error in GLM that uses Tweedie family

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38027:


Assignee: (was: Apache Spark)

> Undefined link function causing error in GLM that uses Tweedie family
> -
>
> Key: SPARK-38027
> URL: https://issues.apache.org/jira/browse/SPARK-38027
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.1.2
> Environment: Running on Mac OS X Monterey
>Reporter: Evan Zamir
>Priority: Major
>  Labels: GLM, pyspark
>
> I am trying to use the GLM regression with a Tweedie distribution so I can 
> model insurance use cases. I have set up a very simple example adapted from 
> the docs:
> {code:python}
> def create_fake_losses_data(self):
> df = self._spark.createDataFrame([
> ("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)),
> ("b", 0.0, 12, 1, Vectors.dense(1.0, 2.0)),
> ("c", 0.0, 12, 1, Vectors.dense(0.0, 0.0)),
> ("d", 2000.0, 12, 1, Vectors.dense(1.0, 1.0)), ], ["user", 
> "label", "offset", "weight", "features"])
> logging.info(df.collect())
> setattr(self, 'fake_data', df)
> try:
> glr = GeneralizedLinearRegression(
> family="tweedie", variancePower=1.5, linkPower=-1, 
> offsetCol='offset')
> glr.setRegParam(0.3)
> model = glr.fit(df)
> logging.info(model)
> except Py4JJavaError as e:
> print(e)
> return self
> {code}
> This causes the following error:
> *py4j.protocol.Py4JJavaError: An error occurred while calling o99.toString.
> : java.util.NoSuchElementException: Failed to find a default value for link*
> at 
> org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756)
> at scala.Option.getOrElse(Option.scala:189)
> at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756)
> at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753)
> at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41)
> at org.apache.spark.ml.param.Params.$(params.scala:762)
> at org.apache.spark.ml.param.Params.$$(params.scala:762)
> at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41)
> at 
> org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748)
> I was under the assumption that the default value for link is None, if not 
> defined otherwise.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38207) Add bucketed scan behavior change to migration guide

2022-02-14 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-38207:
--

 Summary: Add bucketed scan behavior change to migration guide
 Key: SPARK-38207
 URL: https://issues.apache.org/jira/browse/SPARK-38207
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.1.2
Reporter: Manu Zhang


Default behavior of bucketed scan is changed in 
https://issues.apache.org/jira/browse/SPARK-32859 but it's not mentioned in SQL 
migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38033) The structured streaming processing cannot be started because the commitId and offsetId are inconsistent

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491946#comment-17491946
 ] 

Apache Spark commented on SPARK-38033:
--

User 'LLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/35513

> The structured streaming processing cannot be started because the commitId 
> and offsetId are inconsistent
> 
>
> Key: SPARK-38033
> URL: https://issues.apache.org/jira/browse/SPARK-38033
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.6, 3.0.3
>Reporter: LLiu
>Priority: Major
>
> Streaming Processing could not start due to an unexpected machine shutdown.
> The exception is as follows
>  
> {code:java}
> ERROR 22/01/12 02:48:36 MicroBatchExecution: Query 
> streaming_4a026335eafd4bb498ee51752b49f7fb [id = 
> 647ba9e4-16d2-4972-9824-6f9179588806, runId = 
> 92385d5b-f31f-40d0-9ac7-bb7d9796d774] terminated with error
> java.lang.IllegalStateException: batch 113258 doesn't exist
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at scala.Option.getOrElse(Option.scala:121)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:255)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:169)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
> {code}
> I checked checkpoint file on HDFS and found the latest offset is 113259. But 
> commits is 113257. The following
> {code:java}
> commits
> /tmp/streaming_/commits/113253
> /tmp/streaming_/commits/113254
> /tmp/streaming_/commits/113255
> /tmp/streaming_/commits/113256
> /tmp/streaming_/commits/113257
> offset
> /tmp/streaming_/offsets/113253
> /tmp/streaming_/offsets/113254
> /tmp/streaming_/offsets/113255
> /tmp/streaming_/offsets/113256
> /tmp/streaming_/offsets/113257
> /tmp/streaming_/offsets/113259{code}
> Finally, I deleted offsets “/tmp/streaming_/offsets/113259” and the 
> program started normally. I think there is a problem here and we should try 
> to handle this exception or give some resolution in the log.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38033) The structured streaming processing cannot be started because the commitId and offsetId are inconsistent

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491944#comment-17491944
 ] 

Apache Spark commented on SPARK-38033:
--

User 'LLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/35513

> The structured streaming processing cannot be started because the commitId 
> and offsetId are inconsistent
> 
>
> Key: SPARK-38033
> URL: https://issues.apache.org/jira/browse/SPARK-38033
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.6, 3.0.3
>Reporter: LLiu
>Priority: Major
>
> Streaming Processing could not start due to an unexpected machine shutdown.
> The exception is as follows
>  
> {code:java}
> ERROR 22/01/12 02:48:36 MicroBatchExecution: Query 
> streaming_4a026335eafd4bb498ee51752b49f7fb [id = 
> 647ba9e4-16d2-4972-9824-6f9179588806, runId = 
> 92385d5b-f31f-40d0-9ac7-bb7d9796d774] terminated with error
> java.lang.IllegalStateException: batch 113258 doesn't exist
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at scala.Option.getOrElse(Option.scala:121)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:255)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:169)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
> {code}
> I checked checkpoint file on HDFS and found the latest offset is 113259. But 
> commits is 113257. The following
> {code:java}
> commits
> /tmp/streaming_/commits/113253
> /tmp/streaming_/commits/113254
> /tmp/streaming_/commits/113255
> /tmp/streaming_/commits/113256
> /tmp/streaming_/commits/113257
> offset
> /tmp/streaming_/offsets/113253
> /tmp/streaming_/offsets/113254
> /tmp/streaming_/offsets/113255
> /tmp/streaming_/offsets/113256
> /tmp/streaming_/offsets/113257
> /tmp/streaming_/offsets/113259{code}
> Finally, I deleted offsets “/tmp/streaming_/offsets/113259” and the 
> program started normally. I think there is a problem here and we should try 
> to handle this exception or give some resolution in the log.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38206) Relax the requirement of data type comparison for keys in stream-stream join

2022-02-14 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-38206:


 Summary: Relax the requirement of data type comparison for keys in 
stream-stream join
 Key: SPARK-38206
 URL: https://issues.apache.org/jira/browse/SPARK-38206
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.1, 3.1.2, 3.3.0
Reporter: Jungtaek Lim


Currently, stream-stream join checks for the data type compatible between left 
keys and right keys. It is done as "strict" checking, requiring nullability as 
same for both sides. This leads to throw assertion error if optimizer turns 
some columns in one side from nullable to non-nullable but not touching 
opposite side.

If it is logically correct to relax the nullability check (with deciding proper 
type on output schema), we should do it to avoid any possible issue from 
optimization.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38205) The columns in state schema should be relaxed to be nullable

2022-02-14 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-38205:


 Summary: The columns in state schema should be relaxed to be 
nullable
 Key: SPARK-38205
 URL: https://issues.apache.org/jira/browse/SPARK-38205
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.1, 3.1.2, 3.3.0
Reporter: Jungtaek Lim


Starting from SPARK-27237, Spark validates the schema of state across query 
runs to make sure it doesn't fall into more weird issue like SIGSEGV on the 
runtime.

The comparison logic is reasonable in terms of nullability; it has below 
matrices:
||existing schema||new schema||allowed||
|nullable|nullable|O|
|nullable|non-nullable|O|
|non-nullable|nullable|X|
|non-nullable|non-nullable|O|

What we miss here is, the nullability of the column can be changed in the 
optimizer (mostly nullable to non-nullable), and the optimization about 
nullability could be applied differently with any simple changes.

So this scenario is hypothetically possible:

1. At the first run of the query, optimizer marks some columns from nullable to 
non-nullable, and it goes to the schema of the state. (state schema has a 
column with non-nullable)
2. At the second run of the query (possibly with code modification or upgrading 
Spark version), optimizer no longer marks such columns from nullable to 
non-nullable, and it goes with comparison of the schema of the state (existing 
vs new), comparing non-nullable (existing) vs nullable (new), which is NOT 
allowed.

In terms of storage view for state store, it is not required to determine the 
column as non-nullable vs nullable. Interface-wise, state store has no concept 
of schema; so it is safe to relax such constraint, and open the chance for 
optimizer to do whatever it wants and doesn't break stateful operators.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38124) Revive HashClusteredDistribution and apply to stream-stream join

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491933#comment-17491933
 ] 

Apache Spark commented on SPARK-38124:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/35512

> Revive HashClusteredDistribution and apply to stream-stream join
> 
>
> Key: SPARK-38124
> URL: https://issues.apache.org/jira/browse/SPARK-38124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Blocker
> Fix For: 3.3.0
>
>
> SPARK-35703 removed HashClusteredDistribution and replaced its usages with 
> ClusteredDistribution.
> While this works great for non stateful operators, we still need to have a 
> separate requirement of distribution for stateful operator, because the 
> requirement of ClusteredDistribution is too relaxed while the requirement of 
> physical partitioning on stateful operator is quite strict.
> In most cases, stateful operators must require child distribution as 
> HashClusteredDistribution, with below major assumptions:
>  # HashClusteredDistribution creates HashPartitioning and we will never ever 
> change it for the future.
>  # We will never ever change the implementation of {{partitionIdExpression}} 
> in HashPartitioning for the future, so that Partitioner will behave 
> consistently across Spark versions.
>  # No partitioning except HashPartitioning can satisfy 
> HashClusteredDistribution.
>  
> We should revive HashClusteredDistribution (with probably renaming 
> specifically with stateful operator) and apply the distribution to the all 
> stateful operators.
> SPARK-35703 only touched stream-stream join, which means stream-stream join 
> hasn't been broken in actual releases. Let's aim the partial revert of 
> SPARK-35703 in this ticket, and have another ticket to deal with other 
> stateful operators, which have been broken for their introduction (2.2+).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning

2022-02-14 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491927#comment-17491927
 ] 

Jungtaek Lim commented on SPARK-38204:
--

Since this is not a regression, we could change the priority to critical if we 
can't make it in the release period of Spark 3.3.

> All state operators are at a risk of inconsistency between state partitioning 
> and operator partitioning
> ---
>
> Key: SPARK-38204
> URL: https://issues.apache.org/jira/browse/SPARK-38204
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.2.1, 3.3.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
>
> Except stream-stream join, all stateful operators use ClusteredDistribution 
> as a requirement of child distribution.
> ClusteredDistribution is very relaxed one - any output partitioning can 
> satisfy the distribution if the partitioning can ensure all tuples having 
> same grouping keys are placed in same partition.
> To illustrate an example, support we do streaming aggregation like below code:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> In the code, streaming aggregation operator will be involved in physical 
> plan, which would have ClusteredDistribution("group1", "group2", "window").
> The problem is, various output partitionings can satisfy this distribution:
>  * RangePartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination), with any sort order (asc/desc)
>  * HashPartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination)
>  * (upcoming Spark 3.3.0+) DataSourcePartitioning
>  ** output partitioning provided by data source will be able to satisfy 
> ClusteredDistribution, which will make things worse (assuming data source can 
> provide different output partitioning relatively easier)
> e.g. even we only consider HashPartitioning, HashPartitioning("group1"), 
> HashPartitioning("group2"), HashPartitioning("group1", "group2"), 
> HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", 
> "window"), etc.
> The requirement of state partitioning is much more strict, since we should 
> not change the partitioning once it is partitioned and built. *It should 
> ensure that all tuples having same grouping keys are placed in same partition 
> (same partition ID) across query lifetime.*
> *The impedance of distribution requirement between ClusteredDistribution and 
> state partitioning leads correctness issue silently.*
> For example, let's assume we have a streaming query like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group2")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group2") satisfies ClusteredDistribution("group1", "group2", 
> "window"), so Spark won't introduce additional shuffle there, and state 
> partitioning would be HashPartitioning("group2").
> we run this query for a while, and stop the query, and change the manual 
> partitioning like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group1")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group1") also satisfies ClusteredDistribution("group1", 
> "group2", "window"), so Spark won't introduce additional shuffle there. That 
> said, child output partitioning of streaming aggregation operator would be 
> HashPartitioning("group1"), whereas state partitioning is 
> HashPartitioning("group2").
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query]
> In SS guide doc we enumerate the unsupported modifications of the query 
> during the lifetime of streaming query, but there is no notion of this.
> Making this worse, Spark doesn't store any information on state partitioning 
> (that said, there is no way to validate), so *Spark simply allows this change 
> and brings up correctness issue while the streaming query runs like no 
> problem at all.* The only way to indicate the correctness is from the result 
> of the query.
> We have no idea whether end users already suffer from this in their queries 
> or not. *The only way to look into is to list up all state rows and apply 
> hash function with expected grouping keys, and confirm all rows provide the 
> exact partition ID where they are in.* If it turns out as broken, we will 
> have to have a tool to 

[jira] [Created] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning

2022-02-14 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-38204:


 Summary: All state operators are at a risk of inconsistency 
between state partitioning and operator partitioning
 Key: SPARK-38204
 URL: https://issues.apache.org/jira/browse/SPARK-38204
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.1, 3.1.2, 3.0.3, 2.4.8, 2.3.4, 2.2.3, 3.3.0
Reporter: Jungtaek Lim


Except stream-stream join, all stateful operators use ClusteredDistribution as 
a requirement of child distribution.

ClusteredDistribution is very relaxed one - any output partitioning can satisfy 
the distribution if the partitioning can ensure all tuples having same grouping 
keys are placed in same partition.

To illustrate an example, support we do streaming aggregation like below code:
{code:java}
df
  .withWatermark("timestamp", "30 minutes")
  .groupBy("group1", "group2", window("timestamp", "10 minutes"))
  .agg(count("*")) {code}
In the code, streaming aggregation operator will be involved in physical plan, 
which would have ClusteredDistribution("group1", "group2", "window").

The problem is, various output partitionings can satisfy this distribution:
 * RangePartitioning

 ** This accepts exact and subset of the grouping key, with any order of keys 
(combination), with any sort order (asc/desc)

 * HashPartitioning

 ** This accepts exact and subset of the grouping key, with any order of keys 
(combination)

 * (upcoming Spark 3.3.0+) DataSourcePartitioning

 ** output partitioning provided by data source will be able to satisfy 
ClusteredDistribution, which will make things worse (assuming data source can 
provide different output partitioning relatively easier)

e.g. even we only consider HashPartitioning, HashPartitioning("group1"), 
HashPartitioning("group2"), HashPartitioning("group1", "group2"), 
HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", 
"window"), etc.

The requirement of state partitioning is much more strict, since we should not 
change the partitioning once it is partitioned and built. *It should ensure 
that all tuples having same grouping keys are placed in same partition (same 
partition ID) across query lifetime.*

*The impedance of distribution requirement between ClusteredDistribution and 
state partitioning leads correctness issue silently.*

For example, let's assume we have a streaming query like below:
{code:java}
df
  .withWatermark("timestamp", "30 minutes")
  .repartition("group2")
  .groupBy("group1", "group2", window("timestamp", "10 minutes"))
  .agg(count("*")) {code}
repartition("group2") satisfies ClusteredDistribution("group1", "group2", 
"window"), so Spark won't introduce additional shuffle there, and state 
partitioning would be HashPartitioning("group2").

we run this query for a while, and stop the query, and change the manual 
partitioning like below:
{code:java}
df
  .withWatermark("timestamp", "30 minutes")
  .repartition("group1")
  .groupBy("group1", "group2", window("timestamp", "10 minutes"))
  .agg(count("*")) {code}
repartition("group1") also satisfies ClusteredDistribution("group1", "group2", 
"window"), so Spark won't introduce additional shuffle there. That said, child 
output partitioning of streaming aggregation operator would be 
HashPartitioning("group1"), whereas state partitioning is 
HashPartitioning("group2").

[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query]

In SS guide doc we enumerate the unsupported modifications of the query during 
the lifetime of streaming query, but there is no notion of this.

Making this worse, Spark doesn't store any information on state partitioning 
(that said, there is no way to validate), so *Spark simply allows this change 
and brings up correctness issue while the streaming query runs like no problem 
at all.* The only way to indicate the correctness is from the result of the 
query.

We have no idea whether end users already suffer from this in their queries or 
not. *The only way to look into is to list up all state rows and apply hash 
function with expected grouping keys, and confirm all rows provide the exact 
partition ID where they are in.* If it turns out as broken, we will have to 
have a tool to “re”partition the state correctly, or in worst case, have to ask 
throwing out checkpoint and reprocess.

{*}This issue has been laid from the introduction of stateful operators (Spark 
2.2+){*}, since HashClusteredDistribution (strict requirement) had introduced 
in Spark 2.3 and we didn't change stateful operators to use this distribution. 
stream-stream join hopefully used HashClusteredDistribution from Spark 2.3, so 
it seems to be safe.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To 

[jira] [Updated] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning

2022-02-14 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-38204:
-
Labels: correctness  (was: )

> All state operators are at a risk of inconsistency between state partitioning 
> and operator partitioning
> ---
>
> Key: SPARK-38204
> URL: https://issues.apache.org/jira/browse/SPARK-38204
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.2.1, 3.3.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
>
> Except stream-stream join, all stateful operators use ClusteredDistribution 
> as a requirement of child distribution.
> ClusteredDistribution is very relaxed one - any output partitioning can 
> satisfy the distribution if the partitioning can ensure all tuples having 
> same grouping keys are placed in same partition.
> To illustrate an example, support we do streaming aggregation like below code:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> In the code, streaming aggregation operator will be involved in physical 
> plan, which would have ClusteredDistribution("group1", "group2", "window").
> The problem is, various output partitionings can satisfy this distribution:
>  * RangePartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination), with any sort order (asc/desc)
>  * HashPartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination)
>  * (upcoming Spark 3.3.0+) DataSourcePartitioning
>  ** output partitioning provided by data source will be able to satisfy 
> ClusteredDistribution, which will make things worse (assuming data source can 
> provide different output partitioning relatively easier)
> e.g. even we only consider HashPartitioning, HashPartitioning("group1"), 
> HashPartitioning("group2"), HashPartitioning("group1", "group2"), 
> HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", 
> "window"), etc.
> The requirement of state partitioning is much more strict, since we should 
> not change the partitioning once it is partitioned and built. *It should 
> ensure that all tuples having same grouping keys are placed in same partition 
> (same partition ID) across query lifetime.*
> *The impedance of distribution requirement between ClusteredDistribution and 
> state partitioning leads correctness issue silently.*
> For example, let's assume we have a streaming query like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group2")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group2") satisfies ClusteredDistribution("group1", "group2", 
> "window"), so Spark won't introduce additional shuffle there, and state 
> partitioning would be HashPartitioning("group2").
> we run this query for a while, and stop the query, and change the manual 
> partitioning like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group1")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group1") also satisfies ClusteredDistribution("group1", 
> "group2", "window"), so Spark won't introduce additional shuffle there. That 
> said, child output partitioning of streaming aggregation operator would be 
> HashPartitioning("group1"), whereas state partitioning is 
> HashPartitioning("group2").
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query]
> In SS guide doc we enumerate the unsupported modifications of the query 
> during the lifetime of streaming query, but there is no notion of this.
> Making this worse, Spark doesn't store any information on state partitioning 
> (that said, there is no way to validate), so *Spark simply allows this change 
> and brings up correctness issue while the streaming query runs like no 
> problem at all.* The only way to indicate the correctness is from the result 
> of the query.
> We have no idea whether end users already suffer from this in their queries 
> or not. *The only way to look into is to list up all state rows and apply 
> hash function with expected grouping keys, and confirm all rows provide the 
> exact partition ID where they are in.* If it turns out as broken, we will 
> have to have a tool to “re”partition the state correctly, or in worst case, 
> have to ask throwing out checkpoint and reprocess.
> {*}This issue has been laid from the 

[jira] [Assigned] (SPARK-38198) Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when `explainMode` is `CodegenMode`

2022-02-14 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38198:


Assignee: Yang Jie

> Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when 
> `explainMode` is `CodegenMode`
> ---
>
> Key: SPARK-38198
> URL: https://issues.apache.org/jira/browse/SPARK-38198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> `QueryExecution.debug#toFile` method supports passing in `maxFields` and this 
> parameter will be passed down when `explainMode` is SimpleMode, ExtendedMode, 
> or CostMode, but the passed down `maxFields` was ignored because 
> `QueryExecution#stringWithStats` overrides it with 
> `SQLConf.get.maxToStringFields` at present
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38198) Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when `explainMode` is `CodegenMode`

2022-02-14 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38198.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35506
[https://github.com/apache/spark/pull/35506]

> Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when 
> `explainMode` is `CodegenMode`
> ---
>
> Key: SPARK-38198
> URL: https://issues.apache.org/jira/browse/SPARK-38198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> `QueryExecution.debug#toFile` method supports passing in `maxFields` and this 
> parameter will be passed down when `explainMode` is SimpleMode, ExtendedMode, 
> or CostMode, but the passed down `maxFields` was ignored because 
> `QueryExecution#stringWithStats` overrides it with 
> `SQLConf.get.maxToStringFields` at present
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38203) Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI mode

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491920#comment-17491920
 ] 

Apache Spark commented on SPARK-38203:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35511

> Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI mode
> -
>
> Key: SPARK-38203
> URL: https://issues.apache.org/jira/browse/SPARK-38203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38203) Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI mode

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38203:


Assignee: Gengliang Wang  (was: Apache Spark)

> Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI mode
> -
>
> Key: SPARK-38203
> URL: https://issues.apache.org/jira/browse/SPARK-38203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38203) Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI mode

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38203:


Assignee: Apache Spark  (was: Gengliang Wang)

> Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI mode
> -
>
> Key: SPARK-38203
> URL: https://issues.apache.org/jira/browse/SPARK-38203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38203) Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI mode

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491918#comment-17491918
 ] 

Apache Spark commented on SPARK-38203:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35511

> Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI mode
> -
>
> Key: SPARK-38203
> URL: https://issues.apache.org/jira/browse/SPARK-38203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38203) Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI mode

2022-02-14 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-38203:
--

 Summary: Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI 
mode
 Key: SPARK-38203
 URL: https://issues.apache.org/jira/browse/SPARK-38203
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38154) Set up a new GA job to run tests with ANSI mode

2022-02-14 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-38154:
---
Epic Link: SPARK-35030

> Set up a new GA job to run tests with ANSI mode
> ---
>
> Key: SPARK-38154
> URL: https://issues.apache.org/jira/browse/SPARK-38154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28329) SELECT INTO syntax

2022-02-14 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491884#comment-17491884
 ] 

gabrywu commented on SPARK-28329:
-

[~smilegator] Is there a plan to support that? select into a scalar variable? I 
think it's useful to optimize some SQLs like this
{code:SQL}
select max(id) into ${max_id} from db.tableA;
select * from db.tableB where id >= ${max_id};
{code}
It's better than the following SQL, because it can push down the filters id >= 
${max_id}
{code:SQL}
select * from db.tableB where id >= (select max(id) from db.tableA);
{code}

> SELECT INTO syntax
> --
>
> Key: SPARK-28329
> URL: https://issues.apache.org/jira/browse/SPARK-28329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h2. Synopsis
> {noformat}
> [ WITH [ RECURSIVE ] with_query [, ...] ]
> SELECT [ ALL | DISTINCT [ ON ( expression [, ...] ) ] ]
> * | expression [ [ AS ] output_name ] [, ...]
> INTO [ TEMPORARY | TEMP | UNLOGGED ] [ TABLE ] new_table
> [ FROM from_item [, ...] ]
> [ WHERE condition ]
> [ GROUP BY expression [, ...] ]
> [ HAVING condition [, ...] ]
> [ WINDOW window_name AS ( window_definition ) [, ...] ]
> [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
> [ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | 
> LAST } ] [, ...] ]
> [ LIMIT { count | ALL } ]
> [ OFFSET start [ ROW | ROWS ] ]
> [ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
> [ FOR { UPDATE | SHARE } [ OF table_name [, ...] ] [ NOWAIT ] [...] ]
> {noformat}
> h2. Description
> {{SELECT INTO}} creates a new table and fills it with data computed by a 
> query. The data is not returned to the client, as it is with a normal 
> {{SELECT}}. The new table's columns have the names and data types associated 
> with the output columns of the {{SELECT}}.
>  
> {{CREATE TABLE AS}} offers a superset of the functionality offered by 
> {{SELECT INTO}}.
> [https://www.postgresql.org/docs/11/sql-selectinto.html]
>  [https://www.postgresql.org/docs/11/sql-createtableas.html]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38202) Invalid URL in SparkContext.addedJars will constantly fails Executor.run()

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38202:


Assignee: Apache Spark

> Invalid URL in SparkContext.addedJars will constantly fails Executor.run()
> --
>
> Key: SPARK-38202
> URL: https://issues.apache.org/jira/browse/SPARK-38202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Bo Zhang
>Assignee: Apache Spark
>Priority: Major
>
> When an invalid URL is used in SparkContext.addJar(), all subsequent query 
> executions will fail since downloading the jar is in the critical path of 
> Executor.run(), even when the query has noting to do with the jar. 
> A simple reproduce of the issue:
> {code:java}
> sc.addJar("http://invalid/library.jar;)
> (0 to 1).toDF.count
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38202) Invalid URL in SparkContext.addedJars will constantly fails Executor.run()

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38202:


Assignee: (was: Apache Spark)

> Invalid URL in SparkContext.addedJars will constantly fails Executor.run()
> --
>
> Key: SPARK-38202
> URL: https://issues.apache.org/jira/browse/SPARK-38202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Bo Zhang
>Priority: Major
>
> When an invalid URL is used in SparkContext.addJar(), all subsequent query 
> executions will fail since downloading the jar is in the critical path of 
> Executor.run(), even when the query has noting to do with the jar. 
> A simple reproduce of the issue:
> {code:java}
> sc.addJar("http://invalid/library.jar;)
> (0 to 1).toDF.count
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38202) Invalid URL in SparkContext.addedJars will constantly fails Executor.run()

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491873#comment-17491873
 ] 

Apache Spark commented on SPARK-38202:
--

User 'bozhang2820' has created a pull request for this issue:
https://github.com/apache/spark/pull/35510

> Invalid URL in SparkContext.addedJars will constantly fails Executor.run()
> --
>
> Key: SPARK-38202
> URL: https://issues.apache.org/jira/browse/SPARK-38202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Bo Zhang
>Priority: Major
>
> When an invalid URL is used in SparkContext.addJar(), all subsequent query 
> executions will fail since downloading the jar is in the critical path of 
> Executor.run(), even when the query has noting to do with the jar. 
> A simple reproduce of the issue:
> {code:java}
> sc.addJar("http://invalid/library.jar;)
> (0 to 1).toDF.count
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38201) Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and `overwrite`

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491867#comment-17491867
 ] 

Apache Spark commented on SPARK-38201:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35509

> Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and 
> `overwrite`
> -
>
> Key: SPARK-38201
> URL: https://issues.apache.org/jira/browse/SPARK-38201
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Trivial
>
> KubernetesUtils#uploadFileToHadoopCompatibleFS defines the input parameters `
> delSrc` and `overwrite`,  but constants(false and true) are used when call `
> FileSystem.copyFromLocalFile(boolean delSrc, boolean overwrite, Path src, 
> Path dst) ` method.
> `
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22605) OutputMetrics empty for DataFrame writes

2022-02-14 Thread zoli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491869#comment-17491869
 ] 

zoli commented on SPARK-22605:
--

Not solved, still an issue in Spark v3.1.2 for structured streaming.

> OutputMetrics empty for DataFrame writes
> 
>
> Key: SPARK-22605
> URL: https://issues.apache.org/jira/browse/SPARK-22605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jason White
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 2.3.0
>
>
> I am trying to use the SparkListener interface to hook up some custom 
> monitoring for some of our critical jobs. Among the first metrics I would 
> like is an output row count & size metric. I'm using PySpark and the Py4J 
> interface to implement the listener.
> I am able to see the recordsRead and bytesRead metrics via the 
> taskEnd.taskMetrics().inputMetrics().recordsRead() and .bytesRead() methods. 
> taskEnd.taskMetrics().outputMetrics().recordsWritten() and .bytesWritten() 
> are always 0. I see similar output if I use the stageCompleted event instead.
> To trigger execution, I am using df.write.parquet(path). If I use 
> df.rdd.saveAsTextFile(path) instead, the counts and bytes are correct.
> Another clue that this bug is deeper in Spark SQL is that the Spark 
> Application Master doesn't show the Output Size / Records column with 
> df.write.parquet or df.write.text, but does with df.rdd.saveAsTextFile. Since 
> the Spark Application Master also gets its output via the Listener interface, 
> this would seem related.
> There is a related PR: https://issues.apache.org/jira/browse/SPARK-21882, but 
> I believe this to be a distinct issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38202) Invalid URL in SparkContext.addedJars will constantly fails Executor.run()

2022-02-14 Thread Bo Zhang (Jira)
Bo Zhang created SPARK-38202:


 Summary: Invalid URL in SparkContext.addedJars will constantly 
fails Executor.run()
 Key: SPARK-38202
 URL: https://issues.apache.org/jira/browse/SPARK-38202
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.1
Reporter: Bo Zhang


When an invalid URL is used in SparkContext.addJar(), all subsequent query 
executions will fail since downloading the jar is in the critical path of 
Executor.run(), even when the query has noting to do with the jar. 

A simple reproduce of the issue:

{code:java}
sc.addJar("http://invalid/library.jar;)
(0 to 1).toDF.count
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38201) Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and `overwrite`

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38201:


Assignee: (was: Apache Spark)

> Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and 
> `overwrite`
> -
>
> Key: SPARK-38201
> URL: https://issues.apache.org/jira/browse/SPARK-38201
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Trivial
>
> KubernetesUtils#uploadFileToHadoopCompatibleFS defines the input parameters `
> delSrc` and `overwrite`,  but constants(false and true) are used when call `
> FileSystem.copyFromLocalFile(boolean delSrc, boolean overwrite, Path src, 
> Path dst) ` method.
> `
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38201) Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and `overwrite`

2022-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38201:


Assignee: Apache Spark

> Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and 
> `overwrite`
> -
>
> Key: SPARK-38201
> URL: https://issues.apache.org/jira/browse/SPARK-38201
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Trivial
>
> KubernetesUtils#uploadFileToHadoopCompatibleFS defines the input parameters `
> delSrc` and `overwrite`,  but constants(false and true) are used when call `
> FileSystem.copyFromLocalFile(boolean delSrc, boolean overwrite, Path src, 
> Path dst) ` method.
> `
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38201) Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and `overwrite`

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491866#comment-17491866
 ] 

Apache Spark commented on SPARK-38201:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35509

> Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and 
> `overwrite`
> -
>
> Key: SPARK-38201
> URL: https://issues.apache.org/jira/browse/SPARK-38201
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Trivial
>
> KubernetesUtils#uploadFileToHadoopCompatibleFS defines the input parameters `
> delSrc` and `overwrite`,  but constants(false and true) are used when call `
> FileSystem.copyFromLocalFile(boolean delSrc, boolean overwrite, Path src, 
> Path dst) ` method.
> `
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38201) Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and `overwrite`

2022-02-14 Thread Yang Jie (Jira)
Yang Jie created SPARK-38201:


 Summary: Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use 
passed in `delSrc` and `overwrite`
 Key: SPARK-38201
 URL: https://issues.apache.org/jira/browse/SPARK-38201
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Yang Jie


KubernetesUtils#uploadFileToHadoopCompatibleFS defines the input parameters `

delSrc` and `overwrite`,  but constants(false and true) are used when call `

FileSystem.copyFromLocalFile(boolean delSrc, boolean overwrite, Path src, Path 
dst) ` method.

`

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36488) "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491859#comment-17491859
 ] 

Apache Spark commented on SPARK-36488:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35508

> "Invalid usage of '*' in expression" error due to the feature of 
> 'quotedRegexColumnNames' in some scenarios.
> 
>
> Key: SPARK-36488
> URL: https://issues.apache.org/jira/browse/SPARK-36488
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.8, 3.1.2
>Reporter: merrily01
>Assignee: Pablo Langa Blanco
>Priority: Major
> Fix For: 3.3.0
>
>
>  In some cases, the error happens when the following property is set.
> {code:java}
> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")
> {code}
> *case 1:* 
> {code:java}
> spark-sql> create table tb_test as select 1 as col_a, 2 as col_b;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> 1
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> Error in query: Invalid usage of '*' in expression 'unresolvedextractvalue'
> {code}
>  
> *case 2:*
> {code:java}
>          > select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> 0.955414
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> Error in query: Invalid usage of '*' in expression 'divide'
> {code}
>  
> This problem exists in 3.X, 2.4.X and master versions. 
>  
> Related issue : 
> https://issues.apache.org/jira/browse/SPARK-12139
> (As can be seen in the latest comments, some people have encountered the same 
> problem)
>  
> Similar problems:
> https://issues.apache.org/jira/browse/SPARK-28897
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36488) "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.

2022-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491858#comment-17491858
 ] 

Apache Spark commented on SPARK-36488:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35508

> "Invalid usage of '*' in expression" error due to the feature of 
> 'quotedRegexColumnNames' in some scenarios.
> 
>
> Key: SPARK-36488
> URL: https://issues.apache.org/jira/browse/SPARK-36488
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.8, 3.1.2
>Reporter: merrily01
>Assignee: Pablo Langa Blanco
>Priority: Major
> Fix For: 3.3.0
>
>
>  In some cases, the error happens when the following property is set.
> {code:java}
> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")
> {code}
> *case 1:* 
> {code:java}
> spark-sql> create table tb_test as select 1 as col_a, 2 as col_b;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> 1
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> Error in query: Invalid usage of '*' in expression 'unresolvedextractvalue'
> {code}
>  
> *case 2:*
> {code:java}
>          > select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> 0.955414
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> Error in query: Invalid usage of '*' in expression 'divide'
> {code}
>  
> This problem exists in 3.X, 2.4.X and master versions. 
>  
> Related issue : 
> https://issues.apache.org/jira/browse/SPARK-12139
> (As can be seen in the latest comments, some people have encountered the same 
> problem)
>  
> Similar problems:
> https://issues.apache.org/jira/browse/SPARK-28897
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38195) Add the TIMESTAMPADD() function

2022-02-14 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38195:
-
Summary: Add the TIMESTAMPADD() function  (was: Add the timestampadd() 
function)

> Add the TIMESTAMPADD() function
> ---
>
> Key: SPARK-38195
> URL: https://issues.apache.org/jira/browse/SPARK-38195
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The function TIMESTAMPADD() is a part of the ODBC api and implemented 
> virtually by ALL other databases are missing in Spark SQL.
> The first argument is an unary interval (HOUR, YEAR, DAY, etc)
> {code:sql}
> TIMESTAMPADD( SECOND, 2 , timestamp '2021-12-12 12:00:00.00') returns 
> 2021-12-12 12:00:02.00
> {code}
> The ODBC syntax {fn } requires the interval value to have a prefix of 
> sql_tsi_ while plain sql doesn't. See
> * [Time, Date, and Interval 
> Functions|https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/time-date-and-interval-functions?view=sql-server-ver15]
> * [TIMESTAMPADD function (ODBC 
> compatible)|https://docs.faircom.com/doc/sqlref/33476.htm]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38200) [SQL] Spark JDBC Savemode Supports replace

2022-02-14 Thread melin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491853#comment-17491853
 ] 

melin commented on SPARK-38200:
---

[~beliefer] 

> [SQL] Spark JDBC Savemode Supports replace
> --
>
> Key: SPARK-38200
> URL: https://issues.apache.org/jira/browse/SPARK-38200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: melin
>Priority: Major
>
> When writing data into a relational database, data duplication needs to be 
> considered. Both mysql and postgres support upsert syntax.
> {code:java}
> INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) 
> ON CONFLICT (id,name) 
> DO UPDATE SET 
> id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark
>  
> replace into t(id, update_time) values(1, now()); {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38200) [SQL] Spark JDBC Savemode Supports replace

2022-02-14 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-38200:
--
Description: 
When writing data into a relational database, data duplication needs to be 
considered. Both mysql and postgres support upsert syntax.
{code:java}
INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) 
ON CONFLICT (id,name) 
DO UPDATE SET 
id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark
 
replace into t(id, update_time) values(1, now()); {code}

  was:
When writing data into a relational database, data duplication needs to be 
considered. Both mysql and postgres support upsert syntax.

```sql

INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) 
ON CONFLICT (id,name) 
DO UPDATE SET 
id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark

 

replace into t(id, update_time) values(1, now());

```


> [SQL] Spark JDBC Savemode Supports replace
> --
>
> Key: SPARK-38200
> URL: https://issues.apache.org/jira/browse/SPARK-38200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: melin
>Priority: Major
>
> When writing data into a relational database, data duplication needs to be 
> considered. Both mysql and postgres support upsert syntax.
> {code:java}
> INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) 
> ON CONFLICT (id,name) 
> DO UPDATE SET 
> id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark
>  
> replace into t(id, update_time) values(1, now()); {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38200) [SQL] Spark JDBC Savemode Supports replace

2022-02-14 Thread melin (Jira)
melin created SPARK-38200:
-

 Summary: [SQL] Spark JDBC Savemode Supports replace
 Key: SPARK-38200
 URL: https://issues.apache.org/jira/browse/SPARK-38200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: melin


When writing data into a relational database, data duplication needs to be 
considered. Both mysql and postgres support upsert syntax.

```sql

INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) 
ON CONFLICT (id,name) 
DO UPDATE SET 
id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark

 

replace into t(id, update_time) values(1, now());

```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >