[jira] [Assigned] (SPARK-38345) Introduce SQL function ARRAY_SIZE

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38345:


Assignee: (was: Apache Spark)

> Introduce SQL function ARRAY_SIZE
> -
>
> Key: SPARK-38345
> URL: https://issues.apache.org/jira/browse/SPARK-38345
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Counting elements within an array is a common use case. ARRAY_SIZE ensures 
> the input to be an array and then returns the size.
> Other DBRMS like Snowflake supports that as well: 
> https://docs.snowflake.com/en/sql-reference/functions/array_size.html. 
> Implementing that improves compatibility with DBMS and makes migration easier.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38345) Introduce SQL function ARRAY_SIZE

2022-02-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498741#comment-17498741
 ] 

Apache Spark commented on SPARK-38345:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/35671

> Introduce SQL function ARRAY_SIZE
> -
>
> Key: SPARK-38345
> URL: https://issues.apache.org/jira/browse/SPARK-38345
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Counting elements within an array is a common use case. ARRAY_SIZE ensures 
> the input to be an array and then returns the size.
> Other DBRMS like Snowflake supports that as well: 
> https://docs.snowflake.com/en/sql-reference/functions/array_size.html. 
> Implementing that improves compatibility with DBMS and makes migration easier.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38345) Introduce SQL function ARRAY_SIZE

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38345:


Assignee: Apache Spark

> Introduce SQL function ARRAY_SIZE
> -
>
> Key: SPARK-38345
> URL: https://issues.apache.org/jira/browse/SPARK-38345
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Counting elements within an array is a common use case. ARRAY_SIZE ensures 
> the input to be an array and then returns the size.
> Other DBRMS like Snowflake supports that as well: 
> https://docs.snowflake.com/en/sql-reference/functions/array_size.html. 
> Implementing that improves compatibility with DBMS and makes migration easier.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38345) Introduce SQL function ARRAY_SIZE

2022-02-27 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38345:
-
Description: 
Counting elements within an array is a common use case. ARRAY_SIZE ensures the 
input to be an array and then returns the size.

Other DBRMS like Snowflake supports that as well: 
https://docs.snowflake.com/en/sql-reference/functions/array_size.html. 
Implementing that improves compatibility with DBMS and makes migration easier.

  was:
Counting elements within an array is a common use case. Other DBRMS like 
Snowflake supports that as well:

https://docs.snowflake.com/en/sql-reference/functions/array_size.html



> Introduce SQL function ARRAY_SIZE
> -
>
> Key: SPARK-38345
> URL: https://issues.apache.org/jira/browse/SPARK-38345
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Counting elements within an array is a common use case. ARRAY_SIZE ensures 
> the input to be an array and then returns the size.
> Other DBRMS like Snowflake supports that as well: 
> https://docs.snowflake.com/en/sql-reference/functions/array_size.html. 
> Implementing that improves compatibility with DBMS and makes migration easier.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38345) Introduce SQL function ARRAY_SIZE

2022-02-27 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38345:


 Summary: Introduce SQL function ARRAY_SIZE
 Key: SPARK-38345
 URL: https://issues.apache.org/jira/browse/SPARK-38345
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Counting elements within an array is a common use case. Other DBRMS like 
Snowflake supports that as well:

https://docs.snowflake.com/en/sql-reference/functions/array_size.html




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38345) Introduce SQL function ARRAY_SIZE

2022-02-27 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498729#comment-17498729
 ] 

Xinrong Meng commented on SPARK-38345:
--

I am working on that.

> Introduce SQL function ARRAY_SIZE
> -
>
> Key: SPARK-38345
> URL: https://issues.apache.org/jira/browse/SPARK-38345
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Counting elements within an array is a common use case. Other DBRMS like 
> Snowflake supports that as well:
> https://docs.snowflake.com/en/sql-reference/functions/array_size.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38344) Avoid to submit task when there are no requests to push up in push-based shuffle

2022-02-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498727#comment-17498727
 ] 

Apache Spark commented on SPARK-38344:
--

User 'weixiuli' has created a pull request for this issue:
https://github.com/apache/spark/pull/35675

> Avoid to submit task when there are no requests to push up in push-based 
> shuffle
> 
>
> Key: SPARK-38344
> URL: https://issues.apache.org/jira/browse/SPARK-38344
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0, 3.2.1
>Reporter: weixiuli
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38344) Avoid to submit task when there are no requests to push up in push-based shuffle

2022-02-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498726#comment-17498726
 ] 

Apache Spark commented on SPARK-38344:
--

User 'weixiuli' has created a pull request for this issue:
https://github.com/apache/spark/pull/35675

> Avoid to submit task when there are no requests to push up in push-based 
> shuffle
> 
>
> Key: SPARK-38344
> URL: https://issues.apache.org/jira/browse/SPARK-38344
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0, 3.2.1
>Reporter: weixiuli
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38344) Avoid to submit task when there are no requests to push up in push-based shuffle

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38344:


Assignee: (was: Apache Spark)

> Avoid to submit task when there are no requests to push up in push-based 
> shuffle
> 
>
> Key: SPARK-38344
> URL: https://issues.apache.org/jira/browse/SPARK-38344
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0, 3.2.1
>Reporter: weixiuli
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38344) Avoid to submit task when there are no requests to push up in push-based shuffle

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38344:


Assignee: Apache Spark

> Avoid to submit task when there are no requests to push up in push-based 
> shuffle
> 
>
> Key: SPARK-38344
> URL: https://issues.apache.org/jira/browse/SPARK-38344
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0, 3.2.1
>Reporter: weixiuli
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38343) Fix SQLQuerySuite under ANSI mode

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38343:


Assignee: Apache Spark  (was: Gengliang Wang)

> Fix SQLQuerySuite under ANSI mode
> -
>
> Key: SPARK-38343
> URL: https://issues.apache.org/jira/browse/SPARK-38343
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38343) Fix SQLQuerySuite under ANSI mode

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38343:


Assignee: Gengliang Wang  (was: Apache Spark)

> Fix SQLQuerySuite under ANSI mode
> -
>
> Key: SPARK-38343
> URL: https://issues.apache.org/jira/browse/SPARK-38343
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38343) Fix SQLQuerySuite under ANSI mode

2022-02-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498724#comment-17498724
 ] 

Apache Spark commented on SPARK-38343:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35674

> Fix SQLQuerySuite under ANSI mode
> -
>
> Key: SPARK-38343
> URL: https://issues.apache.org/jira/browse/SPARK-38343
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode

2022-02-27 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-38321:
--

Assignee: Xinyi Yu

> Fix BooleanSimplificationSuite under ANSI mode
> --
>
> Key: SPARK-38321
> URL: https://issues.apache.org/jira/browse/SPARK-38321
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode

2022-02-27 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-38321.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35654
[https://github.com/apache/spark/pull/35654]

> Fix BooleanSimplificationSuite under ANSI mode
> --
>
> Key: SPARK-38321
> URL: https://issues.apache.org/jira/browse/SPARK-38321
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38344) Avoid to submit task when there are no requests to push up in push-based shuffle

2022-02-27 Thread weixiuli (Jira)
weixiuli created SPARK-38344:


 Summary: Avoid to submit task when there are no requests to push 
up in push-based shuffle
 Key: SPARK-38344
 URL: https://issues.apache.org/jira/browse/SPARK-38344
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 3.2.1, 3.2.0
Reporter: weixiuli






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38343) Fix SQLQuerySuite under ANSI mode

2022-02-27 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-38343:
--

 Summary: Fix SQLQuerySuite under ANSI mode
 Key: SPARK-38343
 URL: https://issues.apache.org/jira/browse/SPARK-38343
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date

2022-02-27 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498712#comment-17498712
 ] 

Yang Jie edited comment on SPARK-38341 at 2/28/22, 5:48 AM:


I don't think it's a Spark 3.2.1 bug.

 

last_day('2020-06-30') is '2020-06-30', and the result of 
ADD_MONTHS('2020-06-30', -1) is same as `java.time.LocalDate.of(2020, 6, 
30).plusMonths(-1)` and `new org.joda.time.LocalDate(2020, 6, 
30).plusMonths(-1)`.

 

You can use  `last_day(ADD_MONTHS('2020-06-30', -1))` instead to get the 
results you want.

 


was (Author: luciferyang):
I don't think it's a Spark 3.2.1 bug.

 

last_day('2020-06-30') is '2020-06-30', and the result of 
ADD_MONTHS('2020-06-30', -1) is same as that of `java.time.LocalDate.of(2020, 
6, 30).plusMonths(-1)` and `new org.joda.time.LocalDate(2020, 6, 
30).plusMonths(-1)`.

 

You can use  `last_day(ADD_MONTHS('2020-06-30', -1))` instead to get the 
results you want.

 

> Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
> 
>
> Key: SPARK-38341
> URL: https://issues.apache.org/jira/browse/SPARK-38341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: davon.cao
>Priority: Major
>
> Step to reproduce:
> Version of spark sql: 3.2.1(latest  version in maven repository)
> Run sql:
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31
> actual: 2020-05-30 (x)
>  
> Version of spark sql: 2.4.3 
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31 
> actual: 2020-05-31 (/)
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date

2022-02-27 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498712#comment-17498712
 ] 

Yang Jie commented on SPARK-38341:
--

I don't think it's a Spark 3.2.1 bug.

 

last_day('2020-06-30') is '2020-06-30', and the result of 
ADD_MONTHS('2020-06-30', -1) is same as that of `java.time.LocalDate.of(2020, 
6, 30).plusMonths(-1)` and `new org.joda.time.LocalDate(2020, 6, 
30).plusMonths(-1)`.

 

You can use  `last_day(ADD_MONTHS('2020-06-30', -1))` instead to get the 
results you want.

 

> Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
> 
>
> Key: SPARK-38341
> URL: https://issues.apache.org/jira/browse/SPARK-38341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: davon.cao
>Priority: Major
>
> Step to reproduce:
> Version of spark sql: 3.2.1(latest  version in maven repository)
> Run sql:
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31
> actual: 2020-05-30 (x)
>  
> Version of spark sql: 2.4.3 
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31 
> actual: 2020-05-31 (/)
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38204:


Assignee: (was: Apache Spark)

> All state operators are at a risk of inconsistency between state partitioning 
> and operator partitioning
> ---
>
> Key: SPARK-38204
> URL: https://issues.apache.org/jira/browse/SPARK-38204
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.2.1, 3.3.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
>
> Except stream-stream join, all stateful operators use ClusteredDistribution 
> as a requirement of child distribution.
> ClusteredDistribution is very relaxed one - any output partitioning can 
> satisfy the distribution if the partitioning can ensure all tuples having 
> same grouping keys are placed in same partition.
> To illustrate an example, support we do streaming aggregation like below code:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> In the code, streaming aggregation operator will be involved in physical 
> plan, which would have ClusteredDistribution("group1", "group2", "window").
> The problem is, various output partitionings can satisfy this distribution:
>  * RangePartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination), with any sort order (asc/desc)
>  * HashPartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination)
>  * (upcoming Spark 3.3.0+) DataSourcePartitioning
>  ** output partitioning provided by data source will be able to satisfy 
> ClusteredDistribution, which will make things worse (assuming data source can 
> provide different output partitioning relatively easier)
> e.g. even we only consider HashPartitioning, HashPartitioning("group1"), 
> HashPartitioning("group2"), HashPartitioning("group1", "group2"), 
> HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", 
> "window"), etc.
> The requirement of state partitioning is much more strict, since we should 
> not change the partitioning once it is partitioned and built. *It should 
> ensure that all tuples having same grouping keys are placed in same partition 
> (same partition ID) across query lifetime.*
> *The impedance of distribution requirement between ClusteredDistribution and 
> state partitioning leads correctness issue silently.*
> For example, let's assume we have a streaming query like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group2")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group2") satisfies ClusteredDistribution("group1", "group2", 
> "window"), so Spark won't introduce additional shuffle there, and state 
> partitioning would be HashPartitioning("group2").
> we run this query for a while, and stop the query, and change the manual 
> partitioning like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group1")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group1") also satisfies ClusteredDistribution("group1", 
> "group2", "window"), so Spark won't introduce additional shuffle there. That 
> said, child output partitioning of streaming aggregation operator would be 
> HashPartitioning("group1"), whereas state partitioning is 
> HashPartitioning("group2").
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query]
> In SS guide doc we enumerate the unsupported modifications of the query 
> during the lifetime of streaming query, but there is no notion of this.
> Making this worse, Spark doesn't store any information on state partitioning 
> (that said, there is no way to validate), so *Spark simply allows this change 
> and brings up correctness issue while the streaming query runs like no 
> problem at all.* The only way to indicate the correctness is from the result 
> of the query.
> We have no idea whether end users already suffer from this in their queries 
> or not. *The only way to look into is to list up all state rows and apply 
> hash function with expected grouping keys, and confirm all rows provide the 
> exact partition ID where they are in.* If it turns out as broken, we will 
> have to have a tool to “re”partition the state correctly, or in worst case, 
> have to ask throwing out checkpoint and reprocess.
> {*}This issue has been 

[jira] [Assigned] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38204:


Assignee: Apache Spark

> All state operators are at a risk of inconsistency between state partitioning 
> and operator partitioning
> ---
>
> Key: SPARK-38204
> URL: https://issues.apache.org/jira/browse/SPARK-38204
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.2.1, 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
>
> Except stream-stream join, all stateful operators use ClusteredDistribution 
> as a requirement of child distribution.
> ClusteredDistribution is very relaxed one - any output partitioning can 
> satisfy the distribution if the partitioning can ensure all tuples having 
> same grouping keys are placed in same partition.
> To illustrate an example, support we do streaming aggregation like below code:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> In the code, streaming aggregation operator will be involved in physical 
> plan, which would have ClusteredDistribution("group1", "group2", "window").
> The problem is, various output partitionings can satisfy this distribution:
>  * RangePartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination), with any sort order (asc/desc)
>  * HashPartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination)
>  * (upcoming Spark 3.3.0+) DataSourcePartitioning
>  ** output partitioning provided by data source will be able to satisfy 
> ClusteredDistribution, which will make things worse (assuming data source can 
> provide different output partitioning relatively easier)
> e.g. even we only consider HashPartitioning, HashPartitioning("group1"), 
> HashPartitioning("group2"), HashPartitioning("group1", "group2"), 
> HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", 
> "window"), etc.
> The requirement of state partitioning is much more strict, since we should 
> not change the partitioning once it is partitioned and built. *It should 
> ensure that all tuples having same grouping keys are placed in same partition 
> (same partition ID) across query lifetime.*
> *The impedance of distribution requirement between ClusteredDistribution and 
> state partitioning leads correctness issue silently.*
> For example, let's assume we have a streaming query like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group2")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group2") satisfies ClusteredDistribution("group1", "group2", 
> "window"), so Spark won't introduce additional shuffle there, and state 
> partitioning would be HashPartitioning("group2").
> we run this query for a while, and stop the query, and change the manual 
> partitioning like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group1")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group1") also satisfies ClusteredDistribution("group1", 
> "group2", "window"), so Spark won't introduce additional shuffle there. That 
> said, child output partitioning of streaming aggregation operator would be 
> HashPartitioning("group1"), whereas state partitioning is 
> HashPartitioning("group2").
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query]
> In SS guide doc we enumerate the unsupported modifications of the query 
> during the lifetime of streaming query, but there is no notion of this.
> Making this worse, Spark doesn't store any information on state partitioning 
> (that said, there is no way to validate), so *Spark simply allows this change 
> and brings up correctness issue while the streaming query runs like no 
> problem at all.* The only way to indicate the correctness is from the result 
> of the query.
> We have no idea whether end users already suffer from this in their queries 
> or not. *The only way to look into is to list up all state rows and apply 
> hash function with expected grouping keys, and confirm all rows provide the 
> exact partition ID where they are in.* If it turns out as broken, we will 
> have to have a tool to “re”partition the state correctly, or in worst case, 
> have to ask throwing out checkpoint and reprocess.
> 

[jira] [Assigned] (SPARK-38342) Clean up deprecated api usage of Ivy

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38342:


Assignee: (was: Apache Spark)

> Clean up deprecated api usage of Ivy
> 
>
> Key: SPARK-38342
> URL: https://issues.apache.org/jira/browse/SPARK-38342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
> [WARNING] [Warn] 
> /spark-source/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:1459:
>  [deprecation @ 
> org.apache.spark.deploy.SparkSubmitUtils.resolveMavenCoordinates | 
> origin=org.apache.ivy.Ivy.retrieve | version=] method retrieve in class Ivy 
> is deprecated {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning

2022-02-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498706#comment-17498706
 ] 

Apache Spark commented on SPARK-38204:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/35673

> All state operators are at a risk of inconsistency between state partitioning 
> and operator partitioning
> ---
>
> Key: SPARK-38204
> URL: https://issues.apache.org/jira/browse/SPARK-38204
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.2.1, 3.3.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
>
> Except stream-stream join, all stateful operators use ClusteredDistribution 
> as a requirement of child distribution.
> ClusteredDistribution is very relaxed one - any output partitioning can 
> satisfy the distribution if the partitioning can ensure all tuples having 
> same grouping keys are placed in same partition.
> To illustrate an example, support we do streaming aggregation like below code:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> In the code, streaming aggregation operator will be involved in physical 
> plan, which would have ClusteredDistribution("group1", "group2", "window").
> The problem is, various output partitionings can satisfy this distribution:
>  * RangePartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination), with any sort order (asc/desc)
>  * HashPartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination)
>  * (upcoming Spark 3.3.0+) DataSourcePartitioning
>  ** output partitioning provided by data source will be able to satisfy 
> ClusteredDistribution, which will make things worse (assuming data source can 
> provide different output partitioning relatively easier)
> e.g. even we only consider HashPartitioning, HashPartitioning("group1"), 
> HashPartitioning("group2"), HashPartitioning("group1", "group2"), 
> HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", 
> "window"), etc.
> The requirement of state partitioning is much more strict, since we should 
> not change the partitioning once it is partitioned and built. *It should 
> ensure that all tuples having same grouping keys are placed in same partition 
> (same partition ID) across query lifetime.*
> *The impedance of distribution requirement between ClusteredDistribution and 
> state partitioning leads correctness issue silently.*
> For example, let's assume we have a streaming query like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group2")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group2") satisfies ClusteredDistribution("group1", "group2", 
> "window"), so Spark won't introduce additional shuffle there, and state 
> partitioning would be HashPartitioning("group2").
> we run this query for a while, and stop the query, and change the manual 
> partitioning like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group1")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group1") also satisfies ClusteredDistribution("group1", 
> "group2", "window"), so Spark won't introduce additional shuffle there. That 
> said, child output partitioning of streaming aggregation operator would be 
> HashPartitioning("group1"), whereas state partitioning is 
> HashPartitioning("group2").
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query]
> In SS guide doc we enumerate the unsupported modifications of the query 
> during the lifetime of streaming query, but there is no notion of this.
> Making this worse, Spark doesn't store any information on state partitioning 
> (that said, there is no way to validate), so *Spark simply allows this change 
> and brings up correctness issue while the streaming query runs like no 
> problem at all.* The only way to indicate the correctness is from the result 
> of the query.
> We have no idea whether end users already suffer from this in their queries 
> or not. *The only way to look into is to list up all state rows and apply 
> hash function with expected grouping keys, and confirm all rows provide the 
> exact partition ID where they are in.* If it turns out as broken, we will 
> have to have a tool to “re”partition the state 

[jira] [Commented] (SPARK-38342) Clean up deprecated api usage of Ivy

2022-02-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498705#comment-17498705
 ] 

Apache Spark commented on SPARK-38342:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35672

> Clean up deprecated api usage of Ivy
> 
>
> Key: SPARK-38342
> URL: https://issues.apache.org/jira/browse/SPARK-38342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
> [WARNING] [Warn] 
> /spark-source/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:1459:
>  [deprecation @ 
> org.apache.spark.deploy.SparkSubmitUtils.resolveMavenCoordinates | 
> origin=org.apache.ivy.Ivy.retrieve | version=] method retrieve in class Ivy 
> is deprecated {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38342) Clean up deprecated api usage of Ivy

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38342:


Assignee: Apache Spark

> Clean up deprecated api usage of Ivy
> 
>
> Key: SPARK-38342
> URL: https://issues.apache.org/jira/browse/SPARK-38342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> {code:java}
> [WARNING] [Warn] 
> /spark-source/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:1459:
>  [deprecation @ 
> org.apache.spark.deploy.SparkSubmitUtils.resolveMavenCoordinates | 
> origin=org.apache.ivy.Ivy.retrieve | version=] method retrieve in class Ivy 
> is deprecated {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38342) Clean up deprecated api usage of Ivy

2022-02-27 Thread Yang Jie (Jira)
Yang Jie created SPARK-38342:


 Summary: Clean up deprecated api usage of Ivy
 Key: SPARK-38342
 URL: https://issues.apache.org/jira/browse/SPARK-38342
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Yang Jie


{code:java}
[WARNING] [Warn] 
/spark-source/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:1459:
 [deprecation @ 
org.apache.spark.deploy.SparkSubmitUtils.resolveMavenCoordinates | 
origin=org.apache.ivy.Ivy.retrieve | version=] method retrieve in class Ivy is 
deprecated {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-38326) aditya

2022-02-27 Thread Vallepu Durga Aditya (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vallepu Durga Aditya closed SPARK-38326.


final

> aditya
> --
>
> Key: SPARK-38326
> URL: https://issues.apache.org/jira/browse/SPARK-38326
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Vallepu Durga Aditya
>Priority: Major
> Fix For: 3.2.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date

2022-02-27 Thread davon.cao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

davon.cao updated SPARK-38341:
--
Component/s: SQL
 (was: Spark Submit)

> Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
> 
>
> Key: SPARK-38341
> URL: https://issues.apache.org/jira/browse/SPARK-38341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: davon.cao
>Priority: Major
>
> Step to reproduce:
> Version of spark sql: 3.2.1(latest  version in maven repository)
> Run sql:
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31
> actual: 2020-05-30 (x)
>  
> Version of spark sql: 2.4.3 
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31 
> actual: 2020-05-31 (/)
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date

2022-02-27 Thread davon.cao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

davon.cao updated SPARK-38341:
--
Description: 
Step to reproduce:

Version of spark sql: 3.2.1(latest  version in maven repository)

Run sql:

spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()

expect: 2020-05-31

actual: 2020-05-30 (x)

 

Version of spark sql: 2.4.3 

spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()

expect: 2020-05-31 

actual: 2020-05-31 (/)

 

  was:
Step to reproduce:

Version of spark sql: 3.2.1(latest  version in maven repository)

Run sql:

spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()

expect: 2020-05-31

actual: 2020-05-30 

 

Version of spark sql: 2.4.3 (/)

spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()

expect: 2020-05-31 

actual: 2020-05-31 

 


> Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
> 
>
> Key: SPARK-38341
> URL: https://issues.apache.org/jira/browse/SPARK-38341
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.2.1
>Reporter: davon.cao
>Priority: Major
>
> Step to reproduce:
> Version of spark sql: 3.2.1(latest  version in maven repository)
> Run sql:
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31
> actual: 2020-05-30 (x)
>  
> Version of spark sql: 2.4.3 
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31 
> actual: 2020-05-31 (/)
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date

2022-02-27 Thread davon.cao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

davon.cao updated SPARK-38341:
--
Summary: Spark sql: 3.2.1 - Function of add_ Months returns an incorrect 
date  (was: Spark sql - Function of add_ Months returns an incorrect date)

> Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
> 
>
> Key: SPARK-38341
> URL: https://issues.apache.org/jira/browse/SPARK-38341
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.2.1
>Reporter: davon.cao
>Priority: Major
>
> Step to reproduce:
> Version of spark sql: 3.2.1(latest  version in maven repository)
> Run sql:
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31
> actual: 2020-05-30 
>  
> Version of spark sql: 2.4.3 (/)
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31 
> actual: 2020-05-31 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38341) Spark sql - Function of add_ Months returns an incorrect date

2022-02-27 Thread davon.cao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

davon.cao updated SPARK-38341:
--
Description: 
Step to reproduce:

Version of spark sql: 3.2.1(latest  version in maven repository)

Run sql:

spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()

expect: 2020-05-31

actual: 2020-05-30 

 

Version of spark sql: 2.4.3 (/)

spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()

expect: 2020-05-31 

actual: 2020-05-31 

 

  was:
Step to reproduce:

Version of spark sql: 3.2.1(latest  version in maven repository)

Run sql:

spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()

expect: 2020-05-31

actual: 2020-05-30 

 

Version of spark sql: 2.4.3

spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()

expect: 2020-05-31 

actual: 2020-05-31

 


> Spark sql - Function of add_ Months returns an incorrect date
> -
>
> Key: SPARK-38341
> URL: https://issues.apache.org/jira/browse/SPARK-38341
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.2.1
>Reporter: davon.cao
>Priority: Major
>
> Step to reproduce:
> Version of spark sql: 3.2.1(latest  version in maven repository)
> Run sql:
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31
> actual: 2020-05-30 
>  
> Version of spark sql: 2.4.3 (/)
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31 
> actual: 2020-05-31 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38331) csv parser exception when quote and escape are both double-quote and a value is just "," and column pruning enabled

2022-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38331:
-
Component/s: SQL
 (was: Input/Output)

> csv parser exception when quote and escape are both double-quote and a value 
> is just "," and column pruning enabled
> ---
>
> Key: SPARK-38331
> URL: https://issues.apache.org/jira/browse/SPARK-38331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Christopher Auston
>Priority: Minor
>
> Workaround: disable column pruning.
> Example pyspark code (from Databricks):
> {noformat}
> import pyspark
> print(pyspark.version.__version__)
> # enable column pruning (reset default value)
> spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'true')
> dbutils.fs.put(file='/tmp/example.csv', 
> contents='''"col1","b4_comma","comma","col4"
> "","",",","x"
> ''', overwrite=True)
> df = spark.read.csv(
>     path='/tmp/example.csv'
>     ,inferSchema=True
>     ,header=True
>     ,escape='"'
>     ,multiLine=True
>     ,unescapedQuoteHandling='RAISE_ERROR'
>     ,mode='FAILFAST'
>     )
> ex = None
> try:
>     df.select(df.col1,df.comma).take(1)
> except Exception as e:
>     ex = e
>     
> if ex:
>     print('[pruning] Exception is raised if b4_comma is NOT selected')
>     
> df.select(df.b4_comma, df.comma).take(1)
> print('[pruning] No exception if b4_comma is selected')
> ex = None
> try:
>     df.count()
> except Exception as e:
>     ex = e
>     
> if ex:
>     print('[pruning] Exception raised by count')
> print('\ndisabling pruning\n')
>     
>     
> # disable column pruning
> spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'false')
> df.select(df.col1,df.comma).take(1)
> print('[no prune] No exception if b4_comma is NOT selected') {noformat}
>  
> Output:
> {noformat}
> 3.1.2
> Wrote 47 bytes.
> [pruning] Exception is raised if b4_comma is NOT selected
> [pruning] No exception if b4_comma is selected
> [pruning] Exception raised by count
> disabling pruning
> [no prune] No exception if b4_comma is NOT selected {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS

2022-02-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498678#comment-17498678
 ] 

Hyukjin Kwon commented on SPARK-38329:
--

Spark 2.4.X is EOL. Can you test and see if the issue persists in Spark 3+?

> High I/O wait when Spark Structured Streaming checkpoint changed to EFS
> ---
>
> Key: SPARK-38329
> URL: https://issues.apache.org/jira/browse/SPARK-38329
> Project: Spark
>  Issue Type: Question
>  Components: EC2, Input/Output, PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Neven Jovic
>Priority: Major
> Attachments: Screenshot from 2022-02-25 14-16-11.png
>
>
> I'm currently running spark structured streaming application written in 
> python(pyspark) where my source is kafka topic and sink i mongodb. I changed 
> my checkpoint to Amazon EFS, which is distributed on all spark workers and 
> after that I got increased I/o wait, averaging 8%
>  
> !Screenshot from 2022-02-25 14-16-11.png!
> Currently I have 6000 messages coming to kafka every second, and I get every 
> once in a while a WARN message:
> {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up 
> files for HDFSStateStoreProvider[id = (op=0,part=90),dir = 
> file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For 
> input string: ""
> {quote}
> I'm not quite sure if that message has anything to do with high I/O wait and 
> is this behavior expected, or something to be concerned about?
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38327) JDBC Source with MariaDB connection returns column names as values

2022-02-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498677#comment-17498677
 ] 

Hyukjin Kwon commented on SPARK-38327:
--

I think it needs a MariaDB dialect that implements 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

> JDBC Source with MariaDB connection returns column names as values
> --
>
> Key: SPARK-38327
> URL: https://issues.apache.org/jira/browse/SPARK-38327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment: MariaDB version 10.3.10
> Running with spark-k8s-operator
>Reporter: Marvin Rösch
>Priority: Minor
>
> Using a JDBC source with the official MariaDB JDBC driver and a JDBC 
> connection URL like the following does not work as expected:
> {noformat}
> jdbc:mariadb://db.example.com:3306/schema {noformat}
> Assume we have a table "values" like the following in MariaDB:
> ||id (binary)||name (varchar)||
> |0xAB|Name 1|
> |0xBC|Name 2|
> We intend to create and display a data frame from it like this:
> {code:scala}
> spark.read
>   .format("jdbc")
>   .option("url", "jdbc:mariadb://db.example.com:3306/schema")
>   .option("dbtable", "values")
>   .load()
>   .show{code}
> *Expected Behavior*
> Using such a connection URL on an arbitrary MariaDB table or query results in 
> a data frame that reflects the table structure and content from MariaDB 
> correctly, with columns having the correct type and values.
> The output of the above should be
> {noformat}
> ++--+
> |  id|  name|
> ++--+
> |[AB]|Name 1|
> |[BC]|Name 2|
> ++--+{noformat}
> *Observed Behavior*
> Result rows contain column names as values, making them effectively useless 
> to work with.
> The actual output is
> {noformat}
> +---++
> |     id|name|
> +---++
> |[69 64]|name|
> |[69 64]|name|
> +---++{noformat}
> *Further information*
> An easy workaround appears to be specifying "mysql" instead of "mariadb" in 
> the connection URL while explicitly specifying the MariaDB driver. I'd expect 
> the mariadb URL to work out of the box, however.
> It looks like this has been an issue since at least 2016 according to a 
> [StackOverflow 
> post|https://stackoverflow.com/questions/38808463/incorrect-data-while-loading-jdbc-table-in-spark-sql].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38326) aditya

2022-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38326.
--
Resolution: Invalid

> aditya
> --
>
> Key: SPARK-38326
> URL: https://issues.apache.org/jira/browse/SPARK-38326
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Vallepu Durga Aditya
>Priority: Major
> Fix For: 3.2.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38341) Spark sql - Function of add_ Months returns an incorrect date

2022-02-27 Thread davon.cao (Jira)
davon.cao created SPARK-38341:
-

 Summary: Spark sql - Function of add_ Months returns an incorrect 
date
 Key: SPARK-38341
 URL: https://issues.apache.org/jira/browse/SPARK-38341
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 3.2.1
Reporter: davon.cao


Step to reproduce:

Version of spark sql: 3.2.1(latest  version in maven repository)

Run sql:

spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()

expect: 2020-05-31

actual: 2020-05-30 

 

Version of spark sql: 2.4.3

spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()

expect: 2020-05-31 

actual: 2020-05-31

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38337) Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to cleanup deprecated api usage

2022-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38337.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35665
[https://github.com/apache/spark/pull/35665]

> Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to 
> cleanup deprecated api usage
> --
>
> Key: SPARK-38337
> URL: https://issues.apache.org/jira/browse/SPARK-38337
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, MLlib, Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: Screenshot_20220227-050659.png
>
>
> In Scala 2.12, {{IterableLike.toIterator}} identified as 
> {{{}@deprecatedOverriding{}}}:
>  
> {code:java}
> @deprecatedOverriding("toIterator should stay consistent with iterator for 
> all Iterables: override iterator instead.", "2.11.0") 
> override def toIterator: Iterator[A] = iterator {code}
> In Scala 2.13, {{IterableOnce.toIterator}} identified as {{{}@deprecated{}}}:
> {code:java}
> @deprecated("Use .iterator instead of .toIterator", "2.13.0") @`inline` final 
> def toIterator: Iterator[A] = iterator {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38337) Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to cleanup deprecated api usage

2022-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38337:


Assignee: Yang Jie

> Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to 
> cleanup deprecated api usage
> --
>
> Key: SPARK-38337
> URL: https://issues.apache.org/jira/browse/SPARK-38337
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, MLlib, Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Attachments: Screenshot_20220227-050659.png
>
>
> In Scala 2.12, {{IterableLike.toIterator}} identified as 
> {{{}@deprecatedOverriding{}}}:
>  
> {code:java}
> @deprecatedOverriding("toIterator should stay consistent with iterator for 
> all Iterables: override iterator instead.", "2.11.0") 
> override def toIterator: Iterator[A] = iterator {code}
> In Scala 2.13, {{IterableOnce.toIterator}} identified as {{{}@deprecated{}}}:
> {code:java}
> @deprecated("Use .iterator instead of .toIterator", "2.13.0") @`inline` final 
> def toIterator: Iterator[A] = iterator {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38338) Remove test dependency of hamcrest

2022-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38338.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35666
[https://github.com/apache/spark/pull/35666]

> Remove test dependency of hamcrest
> --
>
> Key: SPARK-38338
> URL: https://issues.apache.org/jira/browse/SPARK-38338
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> SPARK-7081 introduces test dependency on hamcrest, but the current Spark UTs 
> doesn't rely too much on this library.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38338) Remove test dependency of hamcrest

2022-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38338:


Assignee: Yang Jie

> Remove test dependency of hamcrest
> --
>
> Key: SPARK-38338
> URL: https://issues.apache.org/jira/browse/SPARK-38338
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> SPARK-7081 introduces test dependency on hamcrest, but the current Spark UTs 
> doesn't rely too much on this library.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38339) Upgrade RoaringBitmap to 0.9.25

2022-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38339.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35668
[https://github.com/apache/spark/pull/35668]

> Upgrade RoaringBitmap to 0.9.25
> ---
>
> Key: SPARK-38339
> URL: https://issues.apache.org/jira/browse/SPARK-38339
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38339) Upgrade RoaringBitmap to 0.9.25

2022-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38339:


Assignee: Yang Jie

> Upgrade RoaringBitmap to 0.9.25
> ---
>
> Key: SPARK-38339
> URL: https://issues.apache.org/jira/browse/SPARK-38339
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38288) Aggregate push down doesnt work using Spark SQL jdbc datasource with postgresql

2022-02-27 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-38288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498669#comment-17498669
 ] 

Daniel Fernández commented on SPARK-38288:
--

[~andrewfmurphy] however in [33352|https://github.com/apache/spark/pull/33352] 
and [33526|https://github.com/apache/spark/pull/33526] it is said that a JDBC 
implementation of aggregate pushdown has already been merged, including 
associated tests for H2.

> Aggregate push down doesnt work using Spark SQL jdbc datasource with 
> postgresql
> ---
>
> Key: SPARK-38288
> URL: https://issues.apache.org/jira/browse/SPARK-38288
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Luis Lozano Coira
>Priority: Major
>  Labels: DataSource, Spark-SQL
>
> I am establishing a connection with postgresql using the Spark SQL jdbc 
> datasource. I have started the spark shell including the postgres driver and 
> I can connect and execute queries without problems. I am using this statement:
> {code:java}
> val df = spark.read.format("jdbc").option("url", 
> "jdbc:postgresql://host:port/").option("driver", 
> "org.postgresql.Driver").option("dbtable", "test").option("user", 
> "postgres").option("password", 
> "***").option("pushDownAggregate",true).load()
> {code}
> I am adding the pushDownAggregate option because I would like the 
> aggregations are delegated to the source. But for some reason this is not 
> happening.
> Reviewing this pull request, it seems that this feature should be merged into 
> 3.2. [https://github.com/apache/spark/pull/29695]
> I am making the aggregations considering the mentioned limitations. An 
> example case where I don't see pushdown being done would be this one:
> {code:java}
> df.groupBy("name").max("age").show()
> {code}
> The results of the queryExecution are shown below:
> {code:java}
> scala> df.groupBy("name").max("age").queryExecution.executedPlan
> res19: org.apache.spark.sql.execution.SparkPlan =
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[name#274], functions=[max(age#246)], output=[name#274, 
> max(age)#544])
>+- Exchange hashpartitioning(name#274, 200), ENSURE_REQUIREMENTS, [id=#205]
>   +- HashAggregate(keys=[name#274], functions=[partial_max(age#246)], 
> output=[name#274, max#548])
>  +- Scan JDBCRelation(test) [numPartitions=1] [age#246,name#274] 
> PushedAggregates: [], PushedFilters: [], PushedGroupby: [], ReadSchema: 
> struct
> scala> dfp.groupBy("name").max("age").queryExecution.toString
> res20: String =
> "== Parsed Logical Plan ==
> Aggregate [name#274], [name#274, max(age#246) AS max(age)#581]
> +- Relation [age#246] JDBCRelation(test) [numPartitions=1]
> == Analyzed Logical Plan ==
> name: string, max(age): int
> Aggregate [name#274], [name#274, max(age#246) AS max(age)#581]
> +- Relation [age#24...
> {code}
> What could be the problem? Should pushDownAggregate work in this case?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38288) Aggregate push down doesnt work using Spark SQL jdbc datasource with postgresql

2022-02-27 Thread Andrew Murphy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498668#comment-17498668
 ] 

Andrew Murphy commented on SPARK-38288:
---

Hi [~llozano] I believe this is because JDBC DataSource V2 has not been fully 
implemented. Even though [29695|https://github.com/apache/spark/pull/29695] has 
merged, reading from a JDBC connection still defaults to JDBC DataSource V1.

> Aggregate push down doesnt work using Spark SQL jdbc datasource with 
> postgresql
> ---
>
> Key: SPARK-38288
> URL: https://issues.apache.org/jira/browse/SPARK-38288
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Luis Lozano Coira
>Priority: Major
>  Labels: DataSource, Spark-SQL
>
> I am establishing a connection with postgresql using the Spark SQL jdbc 
> datasource. I have started the spark shell including the postgres driver and 
> I can connect and execute queries without problems. I am using this statement:
> {code:java}
> val df = spark.read.format("jdbc").option("url", 
> "jdbc:postgresql://host:port/").option("driver", 
> "org.postgresql.Driver").option("dbtable", "test").option("user", 
> "postgres").option("password", 
> "***").option("pushDownAggregate",true).load()
> {code}
> I am adding the pushDownAggregate option because I would like the 
> aggregations are delegated to the source. But for some reason this is not 
> happening.
> Reviewing this pull request, it seems that this feature should be merged into 
> 3.2. [https://github.com/apache/spark/pull/29695]
> I am making the aggregations considering the mentioned limitations. An 
> example case where I don't see pushdown being done would be this one:
> {code:java}
> df.groupBy("name").max("age").show()
> {code}
> The results of the queryExecution are shown below:
> {code:java}
> scala> df.groupBy("name").max("age").queryExecution.executedPlan
> res19: org.apache.spark.sql.execution.SparkPlan =
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[name#274], functions=[max(age#246)], output=[name#274, 
> max(age)#544])
>+- Exchange hashpartitioning(name#274, 200), ENSURE_REQUIREMENTS, [id=#205]
>   +- HashAggregate(keys=[name#274], functions=[partial_max(age#246)], 
> output=[name#274, max#548])
>  +- Scan JDBCRelation(test) [numPartitions=1] [age#246,name#274] 
> PushedAggregates: [], PushedFilters: [], PushedGroupby: [], ReadSchema: 
> struct
> scala> dfp.groupBy("name").max("age").queryExecution.toString
> res20: String =
> "== Parsed Logical Plan ==
> Aggregate [name#274], [name#274, max(age#246) AS max(age)#581]
> +- Relation [age#246] JDBCRelation(test) [numPartitions=1]
> == Analyzed Logical Plan ==
> name: string, max(age): int
> Aggregate [name#274], [name#274, max(age#246) AS max(age)#581]
> +- Relation [age#24...
> {code}
> What could be the problem? Should pushDownAggregate work in this case?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2022-02-27 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498640#comment-17498640
 ] 

Sean R. Owen commented on SPARK-25075:
--

Unknown, though I'd guess not this year. What depends on that though, dropping 
2.12 support?

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Assignee: Sean R. Owen
>Priority: Major
> Fix For: 3.2.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2022-02-27 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498639#comment-17498639
 ] 

Ismael Juma commented on SPARK-25075:
-

Is there a very rough timeline for 4.0 or it completely unknown at this stage?

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Assignee: Sean R. Owen
>Priority: Major
> Fix For: 3.2.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25075) Build and test Spark against Scala 2.13

2022-02-27 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498639#comment-17498639
 ] 

Ismael Juma edited comment on SPARK-25075 at 2/27/22, 5:39 PM:
---

Is there a very rough timeline for 4.0 or is it completely unknown at this 
stage?


was (Author: ijuma):
Is there a very rough timeline for 4.0 or it completely unknown at this stage?

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Assignee: Sean R. Owen
>Priority: Major
> Fix For: 3.2.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2022-02-27 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498622#comment-17498622
 ] 

Sean R. Owen commented on SPARK-25075:
--

I don't think that's the plan. Certainly not to remove 2.12 before 4.0, but 
probably not to change defaults soon either.

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Assignee: Sean R. Owen
>Priority: Major
> Fix For: 3.2.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2022-02-27 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498615#comment-17498615
 ] 

Yang Jie commented on SPARK-25075:
--

Do we plan to switch Scala 2.13 to the default Scala version in Spark 3.3?

 

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Assignee: Sean R. Owen
>Priority: Major
> Fix For: 3.2.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38112) Use error classes in the execution errors of date/timestamp handling

2022-02-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498612#comment-17498612
 ] 

Apache Spark commented on SPARK-38112:
--

User 'ivoson' has created a pull request for this issue:
https://github.com/apache/spark/pull/35670

> Use error classes in the execution errors of date/timestamp handling
> 
>
> Key: SPARK-38112
> URL: https://issues.apache.org/jira/browse/SPARK-38112
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * sparkUpgradeInReadingDatesError
> * sparkUpgradeInWritingDatesError
> * timeZoneIdNotSpecifiedForTimestampTypeError
> * cannotConvertOrcTimestampToTimestampNTZError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38112) Use error classes in the execution errors of date/timestamp handling

2022-02-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498611#comment-17498611
 ] 

Apache Spark commented on SPARK-38112:
--

User 'ivoson' has created a pull request for this issue:
https://github.com/apache/spark/pull/35670

> Use error classes in the execution errors of date/timestamp handling
> 
>
> Key: SPARK-38112
> URL: https://issues.apache.org/jira/browse/SPARK-38112
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * sparkUpgradeInReadingDatesError
> * sparkUpgradeInWritingDatesError
> * timeZoneIdNotSpecifiedForTimestampTypeError
> * cannotConvertOrcTimestampToTimestampNTZError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38112) Use error classes in the execution errors of date/timestamp handling

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38112:


Assignee: (was: Apache Spark)

> Use error classes in the execution errors of date/timestamp handling
> 
>
> Key: SPARK-38112
> URL: https://issues.apache.org/jira/browse/SPARK-38112
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * sparkUpgradeInReadingDatesError
> * sparkUpgradeInWritingDatesError
> * timeZoneIdNotSpecifiedForTimestampTypeError
> * cannotConvertOrcTimestampToTimestampNTZError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38112) Use error classes in the execution errors of date/timestamp handling

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38112:


Assignee: Apache Spark

> Use error classes in the execution errors of date/timestamp handling
> 
>
> Key: SPARK-38112
> URL: https://issues.apache.org/jira/browse/SPARK-38112
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * sparkUpgradeInReadingDatesError
> * sparkUpgradeInWritingDatesError
> * timeZoneIdNotSpecifiedForTimestampTypeError
> * cannotConvertOrcTimestampToTimestampNTZError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38041) DataFilter pushed down with PartitionFilter

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38041:


Assignee: (was: Apache Spark)

> DataFilter pushed down with PartitionFilter
> ---
>
> Key: SPARK-38041
> URL: https://issues.apache.org/jira/browse/SPARK-38041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jackey Lee
>Priority: Major
>
> At present, the Filter is divided into DataFilter and PartitionFilter when it 
> is pushed down, but when the Filter removes the PartitionFilter, it means 
> that all Partitions will scan all DataFilter conditions, which may cause full 
> data scan.
> Here is a example.
> before
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 
> 1)) AND (c#42 < 3)))
> +- *(1) ColumnarToRow
>    +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L 
> < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
> Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
> [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
> PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedGroupBy: 
> [], ReadSchema: struct, PushedFilters: 
> [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedAggregation: [], 
> PushedGroupBy: [] RuntimeFilters: []
> {code}
> after
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 
> 1)) AND (c#42 < 3)))
> +- *(1) ColumnarToRow
>    +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L 
> < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
> Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
> [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
> PushedFilters: 
> [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),Le...,
>  PushedGroupBy: [], ReadSchema: struct, PushedFilters: 
> [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),LessThan(c,3)))],
>  PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38041) DataFilter pushed down with PartitionFilter

2022-02-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498598#comment-17498598
 ] 

Apache Spark commented on SPARK-38041:
--

User 'stczwd' has created a pull request for this issue:
https://github.com/apache/spark/pull/35669

> DataFilter pushed down with PartitionFilter
> ---
>
> Key: SPARK-38041
> URL: https://issues.apache.org/jira/browse/SPARK-38041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jackey Lee
>Priority: Major
>
> At present, the Filter is divided into DataFilter and PartitionFilter when it 
> is pushed down, but when the Filter removes the PartitionFilter, it means 
> that all Partitions will scan all DataFilter conditions, which may cause full 
> data scan.
> Here is a example.
> before
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 
> 1)) AND (c#42 < 3)))
> +- *(1) ColumnarToRow
>    +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L 
> < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
> Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
> [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
> PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedGroupBy: 
> [], ReadSchema: struct, PushedFilters: 
> [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedAggregation: [], 
> PushedGroupBy: [] RuntimeFilters: []
> {code}
> after
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 
> 1)) AND (c#42 < 3)))
> +- *(1) ColumnarToRow
>    +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L 
> < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
> Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
> [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
> PushedFilters: 
> [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),Le...,
>  PushedGroupBy: [], ReadSchema: struct, PushedFilters: 
> [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),LessThan(c,3)))],
>  PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38041) DataFilter pushed down with PartitionFilter

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38041:


Assignee: Apache Spark

> DataFilter pushed down with PartitionFilter
> ---
>
> Key: SPARK-38041
> URL: https://issues.apache.org/jira/browse/SPARK-38041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jackey Lee
>Assignee: Apache Spark
>Priority: Major
>
> At present, the Filter is divided into DataFilter and PartitionFilter when it 
> is pushed down, but when the Filter removes the PartitionFilter, it means 
> that all Partitions will scan all DataFilter conditions, which may cause full 
> data scan.
> Here is a example.
> before
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 
> 1)) AND (c#42 < 3)))
> +- *(1) ColumnarToRow
>    +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L 
> < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
> Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
> [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
> PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedGroupBy: 
> [], ReadSchema: struct, PushedFilters: 
> [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedAggregation: [], 
> PushedGroupBy: [] RuntimeFilters: []
> {code}
> after
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 
> 1)) AND (c#42 < 3)))
> +- *(1) ColumnarToRow
>    +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L 
> < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
> Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
> [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
> PushedFilters: 
> [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),Le...,
>  PushedGroupBy: [], ReadSchema: struct, PushedFilters: 
> [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),LessThan(c,3)))],
>  PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38041) DataFilter pushed down with PartitionFilter

2022-02-27 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-38041:
---
Description: 
At present, the Filter is divided into DataFilter and PartitionFilter when it 
is pushed down, but when the Filter removes the PartitionFilter, it means that 
all Partitions will scan all DataFilter conditions, which may cause full data 
scan.

Here is a example.

before
{code:java}
== Physical Plan ==
*(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) 
AND (c#42 < 3)))
+- *(1) ColumnarToRow
   +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L < 
10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
[((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedGroupBy: 
[], ReadSchema: struct, PushedFilters: 
[Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedAggregation: [], 
PushedGroupBy: [] RuntimeFilters: []
{code}
after
{code:java}
== Physical Plan ==
*(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) 
AND (c#42 < 3)))
+- *(1) ColumnarToRow
   +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L < 
10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
[((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
PushedFilters: 
[Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),Le...,
 PushedGroupBy: [], ReadSchema: struct, PushedFilters: 
[Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),LessThan(c,3)))],
 PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] {code}

  was:
At present, the Filter is divided into DataFilter and PartitionFilter when it 
is pushed down, but when the Filter removes the PartitionFilter, it means that 
all Partitions will scan all DataFilter conditions, which may cause full data 
scan.

Here is a example.

before
{code:java}
== Physical Plan ==
*(1) Filter (((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND 
(c#2 < 3)))
+- *(1) ColumnarToRow
   +- FileScan parquet datasources.test_push_down[a#0,b#1,c#2] Batched: true, 
DataFilters: [((a#0 < 10) OR (a#0 >= 10))], Format: Parquet, Location: 
InMemoryFileIndex(0 paths)[], PartitionFilters: [((c#2 = 0) OR ((c#2 >= 1) AND 
(c#2 < 3)))], PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], 
ReadSchema: struct {code}
after
{code:java}
== Physical Plan ==
*(1) Filter (((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND 
(c#2 < 3)))
+- *(1) ColumnarToRow
   +- FileScan parquet datasources.test_push_down[a#0,b#1,c#2] Batched: true, 
DataFilters: [(((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND 
(c#2 < 3)))], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
PartitionFilters: [((c#2 = 0) OR ((c#2 >= 1) AND (c#2 < 3)))], PushedFilters: 
[Or(LessThan(a,10),GreaterThanOrEqual(a,10))], ReadSchema: struct  
{code}


> DataFilter pushed down with PartitionFilter
> ---
>
> Key: SPARK-38041
> URL: https://issues.apache.org/jira/browse/SPARK-38041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jackey Lee
>Priority: Major
>
> At present, the Filter is divided into DataFilter and PartitionFilter when it 
> is pushed down, but when the Filter removes the PartitionFilter, it means 
> that all Partitions will scan all DataFilter conditions, which may cause full 
> data scan.
> Here is a example.
> before
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 
> 1)) AND (c#42 < 3)))
> +- *(1) ColumnarToRow
>    +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L 
> < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
> Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
> [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
> PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedGroupBy: 
> [], ReadSchema: struct, PushedFilters: 
> [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedAggregation: [], 
> PushedGroupBy: [] RuntimeFilters: []
> {code}
> after
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 
> 1)) AND (c#42 < 3)))
> +- *(1) ColumnarToRow
>    +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L 
> < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 

[jira] [Updated] (SPARK-38340) Upgrade protobuf-java from 2.5.0 to 3.16.1

2022-02-27 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-38340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-38340:

Description: 
 [CVE-2021-22569|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-22569]

To do this upgrade I have done

external/kinesis-asl-assembly/pom.xml change line 65 to 
3.16.1 

pom.xml change line 124 to 3.16.1

run 
./dev/test-dependencies.sh --replace-manifest 

witch change 
dev/deps/spark-deps-hadoop-2-hive-2.3 line 235 to 
protobuf-java/3.16.1//protobuf-java-3.16.1.jar

and 

dev/deps/spark-deps-hadoop-3-hive-2.3 to 
protobuf-java/3.16.1//protobuf-java-3.16.1.jar 

My branch 
[protobuf-java-from-2.5.0-to-3.16.1|https://github.com/bjornjorgensen/spark/tree/protobuf-java-from-2.5.0-to-3.16.1]
 is OK with testes, but  when I run 

./build/mvn -DskipTests clean package && ./build/mvn -e package
 

I get this error:

01:01:41.381 WARN 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReaderSuite: 

= POSSIBLE THREAD LEAK IN SUITE 
o.a.s.sql.execution.datasources.orc.OrcColumnarBatchReaderSuite, threads: 
rpc-boss-3348-1 (daemon=true), shuffle-boss-3351-1 (daemon=true) =

Run completed in 1 hour, 7 minutes, 35 seconds.
Total number of tests run: 11260
Suites: completed 505, aborted 0
Tests: succeeded 11259, failed 1, canceled 5, ignored 57, pending 0
*** 1 TEST FAILED ***
[INFO] 
[INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [  3.396 s]
[INFO] Spark Project Tags . SUCCESS [  7.374 s]
[INFO] Spark Project Sketch ... SUCCESS [  9.324 s]
[INFO] Spark Project Local DB . SUCCESS [  4.097 s]
[INFO] Spark Project Networking ... SUCCESS [ 47.468 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 10.478 s]
[INFO] Spark Project Unsafe ... SUCCESS [  2.425 s]
[INFO] Spark Project Launcher . SUCCESS [  2.767 s]
[INFO] Spark Project Core . SUCCESS [30:56 min]
[INFO] Spark Project ML Local Library . SUCCESS [ 29.105 s]
[INFO] Spark Project GraphX ... SUCCESS [02:09 min]
[INFO] Spark Project Streaming  SUCCESS [05:21 min]
[INFO] Spark Project Catalyst . SUCCESS [08:15 min]
[INFO] Spark Project SQL .. FAILURE [  01:11 h]
[INFO] Spark Project ML Library ... SKIPPED
[INFO] Spark Project Tools  SKIPPED
[INFO] Spark Project Hive . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project Assembly . SKIPPED
[INFO] Kafka 0.10+ Token Provider for Streaming ... SKIPPED
[INFO] Spark Integration for Kafka 0.10 ... SKIPPED
[INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
[INFO] Spark Avro . SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time:  02:00 h
[INFO] Finished at: 2022-02-27T01:01:44+01:00
[INFO] 
[ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.2:test 
(test) on project spark-sql_2.12: There are test failures -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project 
spark-sql_2.12: There are test failures
at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:215)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:156)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:148)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
(LifecycleModuleBuilder.java:117)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
(LifecycleModuleBuilder.java:81)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
 (SingleThreadedBuilder.java:56)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)

[jira] [Created] (SPARK-38340) Upgrade protobuf-java from 2.5.0 to 3.16.1

2022-02-27 Thread Jira
Bjørn Jørgensen created SPARK-38340:
---

 Summary: Upgrade protobuf-java from 2.5.0 to 3.16.1
 Key: SPARK-38340
 URL: https://issues.apache.org/jira/browse/SPARK-38340
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.3.0
Reporter: Bjørn Jørgensen


 [CVE-2021-22569|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-22569]

To do this upgrade I have don

external/kinesis-asl-assembly/pom.xml change line 65 to 
3.16.1 

pom.xml change line 124 to 3.16.1

run 
./dev/test-dependencies.sh --replace-manifest 

witch change 
dev/deps/spark-deps-hadoop-2-hive-2.3 line 235 to 
protobuf-java/3.16.1//protobuf-java-3.16.1.jar

and 

dev/deps/spark-deps-hadoop-3-hive-2.3 to 
protobuf-java/3.16.1//protobuf-java-3.16.1.jar 

My branch 
[protobuf-java-from-2.5.0-to-3.16.1|https://github.com/bjornjorgensen/spark/tree/protobuf-java-from-2.5.0-to-3.16.1]
 is OK with testes, but  when I run 

./build/mvn -DskipTests clean package && ./build/mvn -e package
 

I get this error:

01:01:41.381 WARN 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReaderSuite: 

= POSSIBLE THREAD LEAK IN SUITE 
o.a.s.sql.execution.datasources.orc.OrcColumnarBatchReaderSuite, threads: 
rpc-boss-3348-1 (daemon=true), shuffle-boss-3351-1 (daemon=true) =

Run completed in 1 hour, 7 minutes, 35 seconds.
Total number of tests run: 11260
Suites: completed 505, aborted 0
Tests: succeeded 11259, failed 1, canceled 5, ignored 57, pending 0
*** 1 TEST FAILED ***
[INFO] 
[INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [  3.396 s]
[INFO] Spark Project Tags . SUCCESS [  7.374 s]
[INFO] Spark Project Sketch ... SUCCESS [  9.324 s]
[INFO] Spark Project Local DB . SUCCESS [  4.097 s]
[INFO] Spark Project Networking ... SUCCESS [ 47.468 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 10.478 s]
[INFO] Spark Project Unsafe ... SUCCESS [  2.425 s]
[INFO] Spark Project Launcher . SUCCESS [  2.767 s]
[INFO] Spark Project Core . SUCCESS [30:56 min]
[INFO] Spark Project ML Local Library . SUCCESS [ 29.105 s]
[INFO] Spark Project GraphX ... SUCCESS [02:09 min]
[INFO] Spark Project Streaming  SUCCESS [05:21 min]
[INFO] Spark Project Catalyst . SUCCESS [08:15 min]
[INFO] Spark Project SQL .. FAILURE [  01:11 h]
[INFO] Spark Project ML Library ... SKIPPED
[INFO] Spark Project Tools  SKIPPED
[INFO] Spark Project Hive . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project Assembly . SKIPPED
[INFO] Kafka 0.10+ Token Provider for Streaming ... SKIPPED
[INFO] Spark Integration for Kafka 0.10 ... SKIPPED
[INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
[INFO] Spark Avro . SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time:  02:00 h
[INFO] Finished at: 2022-02-27T01:01:44+01:00
[INFO] 
[ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.2:test 
(test) on project spark-sql_2.12: There are test failures -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project 
spark-sql_2.12: There are test failures
at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:215)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:156)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:148)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
(LifecycleModuleBuilder.java:117)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
(LifecycleModuleBuilder.java:81)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
 (SingleThreadedBuilder.java:56)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 

[jira] [Assigned] (SPARK-38339) Upgrade RoaringBitmap to 0.9.25

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38339:


Assignee: (was: Apache Spark)

> Upgrade RoaringBitmap to 0.9.25
> ---
>
> Key: SPARK-38339
> URL: https://issues.apache.org/jira/browse/SPARK-38339
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38339) Upgrade RoaringBitmap to 0.9.25

2022-02-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38339:


Assignee: Apache Spark

> Upgrade RoaringBitmap to 0.9.25
> ---
>
> Key: SPARK-38339
> URL: https://issues.apache.org/jira/browse/SPARK-38339
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38339) Upgrade RoaringBitmap to 0.9.25

2022-02-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498577#comment-17498577
 ] 

Apache Spark commented on SPARK-38339:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35668

> Upgrade RoaringBitmap to 0.9.25
> ---
>
> Key: SPARK-38339
> URL: https://issues.apache.org/jira/browse/SPARK-38339
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38339) Upgrade RoaringBitmap to 0.9.25

2022-02-27 Thread Yang Jie (Jira)
Yang Jie created SPARK-38339:


 Summary: Upgrade RoaringBitmap to 0.9.25
 Key: SPARK-38339
 URL: https://issues.apache.org/jira/browse/SPARK-38339
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38337) Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to cleanup deprecated api usage

2022-02-27 Thread Jj (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jj updated SPARK-38337:
---
Attachment: Screenshot_20220227-050659.png

> Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to 
> cleanup deprecated api usage
> --
>
> Key: SPARK-38337
> URL: https://issues.apache.org/jira/browse/SPARK-38337
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, MLlib, Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: Screenshot_20220227-050659.png
>
>
> In Scala 2.12, {{IterableLike.toIterator}} identified as 
> {{{}@deprecatedOverriding{}}}:
>  
> {code:java}
> @deprecatedOverriding("toIterator should stay consistent with iterator for 
> all Iterables: override iterator instead.", "2.11.0") 
> override def toIterator: Iterator[A] = iterator {code}
> In Scala 2.13, {{IterableOnce.toIterator}} identified as {{{}@deprecated{}}}:
> {code:java}
> @deprecated("Use .iterator instead of .toIterator", "2.13.0") @`inline` final 
> def toIterator: Iterator[A] = iterator {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org