[jira] [Assigned] (SPARK-38345) Introduce SQL function ARRAY_SIZE
[ https://issues.apache.org/jira/browse/SPARK-38345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38345: Assignee: (was: Apache Spark) > Introduce SQL function ARRAY_SIZE > - > > Key: SPARK-38345 > URL: https://issues.apache.org/jira/browse/SPARK-38345 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Counting elements within an array is a common use case. ARRAY_SIZE ensures > the input to be an array and then returns the size. > Other DBRMS like Snowflake supports that as well: > https://docs.snowflake.com/en/sql-reference/functions/array_size.html. > Implementing that improves compatibility with DBMS and makes migration easier. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38345) Introduce SQL function ARRAY_SIZE
[ https://issues.apache.org/jira/browse/SPARK-38345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498741#comment-17498741 ] Apache Spark commented on SPARK-38345: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/35671 > Introduce SQL function ARRAY_SIZE > - > > Key: SPARK-38345 > URL: https://issues.apache.org/jira/browse/SPARK-38345 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Counting elements within an array is a common use case. ARRAY_SIZE ensures > the input to be an array and then returns the size. > Other DBRMS like Snowflake supports that as well: > https://docs.snowflake.com/en/sql-reference/functions/array_size.html. > Implementing that improves compatibility with DBMS and makes migration easier. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38345) Introduce SQL function ARRAY_SIZE
[ https://issues.apache.org/jira/browse/SPARK-38345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38345: Assignee: Apache Spark > Introduce SQL function ARRAY_SIZE > - > > Key: SPARK-38345 > URL: https://issues.apache.org/jira/browse/SPARK-38345 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Counting elements within an array is a common use case. ARRAY_SIZE ensures > the input to be an array and then returns the size. > Other DBRMS like Snowflake supports that as well: > https://docs.snowflake.com/en/sql-reference/functions/array_size.html. > Implementing that improves compatibility with DBMS and makes migration easier. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38345) Introduce SQL function ARRAY_SIZE
[ https://issues.apache.org/jira/browse/SPARK-38345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38345: - Description: Counting elements within an array is a common use case. ARRAY_SIZE ensures the input to be an array and then returns the size. Other DBRMS like Snowflake supports that as well: https://docs.snowflake.com/en/sql-reference/functions/array_size.html. Implementing that improves compatibility with DBMS and makes migration easier. was: Counting elements within an array is a common use case. Other DBRMS like Snowflake supports that as well: https://docs.snowflake.com/en/sql-reference/functions/array_size.html > Introduce SQL function ARRAY_SIZE > - > > Key: SPARK-38345 > URL: https://issues.apache.org/jira/browse/SPARK-38345 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Counting elements within an array is a common use case. ARRAY_SIZE ensures > the input to be an array and then returns the size. > Other DBRMS like Snowflake supports that as well: > https://docs.snowflake.com/en/sql-reference/functions/array_size.html. > Implementing that improves compatibility with DBMS and makes migration easier. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38345) Introduce SQL function ARRAY_SIZE
Xinrong Meng created SPARK-38345: Summary: Introduce SQL function ARRAY_SIZE Key: SPARK-38345 URL: https://issues.apache.org/jira/browse/SPARK-38345 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: Xinrong Meng Counting elements within an array is a common use case. Other DBRMS like Snowflake supports that as well: https://docs.snowflake.com/en/sql-reference/functions/array_size.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38345) Introduce SQL function ARRAY_SIZE
[ https://issues.apache.org/jira/browse/SPARK-38345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498729#comment-17498729 ] Xinrong Meng commented on SPARK-38345: -- I am working on that. > Introduce SQL function ARRAY_SIZE > - > > Key: SPARK-38345 > URL: https://issues.apache.org/jira/browse/SPARK-38345 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Counting elements within an array is a common use case. Other DBRMS like > Snowflake supports that as well: > https://docs.snowflake.com/en/sql-reference/functions/array_size.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38344) Avoid to submit task when there are no requests to push up in push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-38344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498727#comment-17498727 ] Apache Spark commented on SPARK-38344: -- User 'weixiuli' has created a pull request for this issue: https://github.com/apache/spark/pull/35675 > Avoid to submit task when there are no requests to push up in push-based > shuffle > > > Key: SPARK-38344 > URL: https://issues.apache.org/jira/browse/SPARK-38344 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 3.2.0, 3.2.1 >Reporter: weixiuli >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38344) Avoid to submit task when there are no requests to push up in push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-38344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498726#comment-17498726 ] Apache Spark commented on SPARK-38344: -- User 'weixiuli' has created a pull request for this issue: https://github.com/apache/spark/pull/35675 > Avoid to submit task when there are no requests to push up in push-based > shuffle > > > Key: SPARK-38344 > URL: https://issues.apache.org/jira/browse/SPARK-38344 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 3.2.0, 3.2.1 >Reporter: weixiuli >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38344) Avoid to submit task when there are no requests to push up in push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-38344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38344: Assignee: (was: Apache Spark) > Avoid to submit task when there are no requests to push up in push-based > shuffle > > > Key: SPARK-38344 > URL: https://issues.apache.org/jira/browse/SPARK-38344 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 3.2.0, 3.2.1 >Reporter: weixiuli >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38344) Avoid to submit task when there are no requests to push up in push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-38344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38344: Assignee: Apache Spark > Avoid to submit task when there are no requests to push up in push-based > shuffle > > > Key: SPARK-38344 > URL: https://issues.apache.org/jira/browse/SPARK-38344 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 3.2.0, 3.2.1 >Reporter: weixiuli >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38343) Fix SQLQuerySuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38343: Assignee: Apache Spark (was: Gengliang Wang) > Fix SQLQuerySuite under ANSI mode > - > > Key: SPARK-38343 > URL: https://issues.apache.org/jira/browse/SPARK-38343 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38343) Fix SQLQuerySuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38343: Assignee: Gengliang Wang (was: Apache Spark) > Fix SQLQuerySuite under ANSI mode > - > > Key: SPARK-38343 > URL: https://issues.apache.org/jira/browse/SPARK-38343 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38343) Fix SQLQuerySuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498724#comment-17498724 ] Apache Spark commented on SPARK-38343: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/35674 > Fix SQLQuerySuite under ANSI mode > - > > Key: SPARK-38343 > URL: https://issues.apache.org/jira/browse/SPARK-38343 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-38321: -- Assignee: Xinyi Yu > Fix BooleanSimplificationSuite under ANSI mode > -- > > Key: SPARK-38321 > URL: https://issues.apache.org/jira/browse/SPARK-38321 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Assignee: Xinyi Yu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-38321. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35654 [https://github.com/apache/spark/pull/35654] > Fix BooleanSimplificationSuite under ANSI mode > -- > > Key: SPARK-38321 > URL: https://issues.apache.org/jira/browse/SPARK-38321 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Assignee: Xinyi Yu >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38344) Avoid to submit task when there are no requests to push up in push-based shuffle
weixiuli created SPARK-38344: Summary: Avoid to submit task when there are no requests to push up in push-based shuffle Key: SPARK-38344 URL: https://issues.apache.org/jira/browse/SPARK-38344 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 3.2.1, 3.2.0 Reporter: weixiuli -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38343) Fix SQLQuerySuite under ANSI mode
Gengliang Wang created SPARK-38343: -- Summary: Fix SQLQuerySuite under ANSI mode Key: SPARK-38343 URL: https://issues.apache.org/jira/browse/SPARK-38343 Project: Spark Issue Type: Sub-task Components: SQL, Tests Affects Versions: 3.3.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
[ https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498712#comment-17498712 ] Yang Jie edited comment on SPARK-38341 at 2/28/22, 5:48 AM: I don't think it's a Spark 3.2.1 bug. last_day('2020-06-30') is '2020-06-30', and the result of ADD_MONTHS('2020-06-30', -1) is same as `java.time.LocalDate.of(2020, 6, 30).plusMonths(-1)` and `new org.joda.time.LocalDate(2020, 6, 30).plusMonths(-1)`. You can use `last_day(ADD_MONTHS('2020-06-30', -1))` instead to get the results you want. was (Author: luciferyang): I don't think it's a Spark 3.2.1 bug. last_day('2020-06-30') is '2020-06-30', and the result of ADD_MONTHS('2020-06-30', -1) is same as that of `java.time.LocalDate.of(2020, 6, 30).plusMonths(-1)` and `new org.joda.time.LocalDate(2020, 6, 30).plusMonths(-1)`. You can use `last_day(ADD_MONTHS('2020-06-30', -1))` instead to get the results you want. > Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date > > > Key: SPARK-38341 > URL: https://issues.apache.org/jira/browse/SPARK-38341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: davon.cao >Priority: Major > > Step to reproduce: > Version of spark sql: 3.2.1(latest version in maven repository) > Run sql: > spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() > expect: 2020-05-31 > actual: 2020-05-30 (x) > > Version of spark sql: 2.4.3 > spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() > expect: 2020-05-31 > actual: 2020-05-31 (/) > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
[ https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498712#comment-17498712 ] Yang Jie commented on SPARK-38341: -- I don't think it's a Spark 3.2.1 bug. last_day('2020-06-30') is '2020-06-30', and the result of ADD_MONTHS('2020-06-30', -1) is same as that of `java.time.LocalDate.of(2020, 6, 30).plusMonths(-1)` and `new org.joda.time.LocalDate(2020, 6, 30).plusMonths(-1)`. You can use `last_day(ADD_MONTHS('2020-06-30', -1))` instead to get the results you want. > Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date > > > Key: SPARK-38341 > URL: https://issues.apache.org/jira/browse/SPARK-38341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: davon.cao >Priority: Major > > Step to reproduce: > Version of spark sql: 3.2.1(latest version in maven repository) > Run sql: > spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() > expect: 2020-05-31 > actual: 2020-05-30 (x) > > Version of spark sql: 2.4.3 > spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() > expect: 2020-05-31 > actual: 2020-05-31 (/) > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning
[ https://issues.apache.org/jira/browse/SPARK-38204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38204: Assignee: (was: Apache Spark) > All state operators are at a risk of inconsistency between state partitioning > and operator partitioning > --- > > Key: SPARK-38204 > URL: https://issues.apache.org/jira/browse/SPARK-38204 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.2.1, 3.3.0 >Reporter: Jungtaek Lim >Priority: Blocker > Labels: correctness > > Except stream-stream join, all stateful operators use ClusteredDistribution > as a requirement of child distribution. > ClusteredDistribution is very relaxed one - any output partitioning can > satisfy the distribution if the partitioning can ensure all tuples having > same grouping keys are placed in same partition. > To illustrate an example, support we do streaming aggregation like below code: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > In the code, streaming aggregation operator will be involved in physical > plan, which would have ClusteredDistribution("group1", "group2", "window"). > The problem is, various output partitionings can satisfy this distribution: > * RangePartitioning > ** This accepts exact and subset of the grouping key, with any order of keys > (combination), with any sort order (asc/desc) > * HashPartitioning > ** This accepts exact and subset of the grouping key, with any order of keys > (combination) > * (upcoming Spark 3.3.0+) DataSourcePartitioning > ** output partitioning provided by data source will be able to satisfy > ClusteredDistribution, which will make things worse (assuming data source can > provide different output partitioning relatively easier) > e.g. even we only consider HashPartitioning, HashPartitioning("group1"), > HashPartitioning("group2"), HashPartitioning("group1", "group2"), > HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", > "window"), etc. > The requirement of state partitioning is much more strict, since we should > not change the partitioning once it is partitioned and built. *It should > ensure that all tuples having same grouping keys are placed in same partition > (same partition ID) across query lifetime.* > *The impedance of distribution requirement between ClusteredDistribution and > state partitioning leads correctness issue silently.* > For example, let's assume we have a streaming query like below: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .repartition("group2") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > repartition("group2") satisfies ClusteredDistribution("group1", "group2", > "window"), so Spark won't introduce additional shuffle there, and state > partitioning would be HashPartitioning("group2"). > we run this query for a while, and stop the query, and change the manual > partitioning like below: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .repartition("group1") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > repartition("group1") also satisfies ClusteredDistribution("group1", > "group2", "window"), so Spark won't introduce additional shuffle there. That > said, child output partitioning of streaming aggregation operator would be > HashPartitioning("group1"), whereas state partitioning is > HashPartitioning("group2"). > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query] > In SS guide doc we enumerate the unsupported modifications of the query > during the lifetime of streaming query, but there is no notion of this. > Making this worse, Spark doesn't store any information on state partitioning > (that said, there is no way to validate), so *Spark simply allows this change > and brings up correctness issue while the streaming query runs like no > problem at all.* The only way to indicate the correctness is from the result > of the query. > We have no idea whether end users already suffer from this in their queries > or not. *The only way to look into is to list up all state rows and apply > hash function with expected grouping keys, and confirm all rows provide the > exact partition ID where they are in.* If it turns out as broken, we will > have to have a tool to “re”partition the state correctly, or in worst case, > have to ask throwing out checkpoint and reprocess. > {*}This issue has been
[jira] [Assigned] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning
[ https://issues.apache.org/jira/browse/SPARK-38204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38204: Assignee: Apache Spark > All state operators are at a risk of inconsistency between state partitioning > and operator partitioning > --- > > Key: SPARK-38204 > URL: https://issues.apache.org/jira/browse/SPARK-38204 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.2.1, 3.3.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Blocker > Labels: correctness > > Except stream-stream join, all stateful operators use ClusteredDistribution > as a requirement of child distribution. > ClusteredDistribution is very relaxed one - any output partitioning can > satisfy the distribution if the partitioning can ensure all tuples having > same grouping keys are placed in same partition. > To illustrate an example, support we do streaming aggregation like below code: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > In the code, streaming aggregation operator will be involved in physical > plan, which would have ClusteredDistribution("group1", "group2", "window"). > The problem is, various output partitionings can satisfy this distribution: > * RangePartitioning > ** This accepts exact and subset of the grouping key, with any order of keys > (combination), with any sort order (asc/desc) > * HashPartitioning > ** This accepts exact and subset of the grouping key, with any order of keys > (combination) > * (upcoming Spark 3.3.0+) DataSourcePartitioning > ** output partitioning provided by data source will be able to satisfy > ClusteredDistribution, which will make things worse (assuming data source can > provide different output partitioning relatively easier) > e.g. even we only consider HashPartitioning, HashPartitioning("group1"), > HashPartitioning("group2"), HashPartitioning("group1", "group2"), > HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", > "window"), etc. > The requirement of state partitioning is much more strict, since we should > not change the partitioning once it is partitioned and built. *It should > ensure that all tuples having same grouping keys are placed in same partition > (same partition ID) across query lifetime.* > *The impedance of distribution requirement between ClusteredDistribution and > state partitioning leads correctness issue silently.* > For example, let's assume we have a streaming query like below: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .repartition("group2") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > repartition("group2") satisfies ClusteredDistribution("group1", "group2", > "window"), so Spark won't introduce additional shuffle there, and state > partitioning would be HashPartitioning("group2"). > we run this query for a while, and stop the query, and change the manual > partitioning like below: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .repartition("group1") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > repartition("group1") also satisfies ClusteredDistribution("group1", > "group2", "window"), so Spark won't introduce additional shuffle there. That > said, child output partitioning of streaming aggregation operator would be > HashPartitioning("group1"), whereas state partitioning is > HashPartitioning("group2"). > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query] > In SS guide doc we enumerate the unsupported modifications of the query > during the lifetime of streaming query, but there is no notion of this. > Making this worse, Spark doesn't store any information on state partitioning > (that said, there is no way to validate), so *Spark simply allows this change > and brings up correctness issue while the streaming query runs like no > problem at all.* The only way to indicate the correctness is from the result > of the query. > We have no idea whether end users already suffer from this in their queries > or not. *The only way to look into is to list up all state rows and apply > hash function with expected grouping keys, and confirm all rows provide the > exact partition ID where they are in.* If it turns out as broken, we will > have to have a tool to “re”partition the state correctly, or in worst case, > have to ask throwing out checkpoint and reprocess. >
[jira] [Assigned] (SPARK-38342) Clean up deprecated api usage of Ivy
[ https://issues.apache.org/jira/browse/SPARK-38342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38342: Assignee: (was: Apache Spark) > Clean up deprecated api usage of Ivy > > > Key: SPARK-38342 > URL: https://issues.apache.org/jira/browse/SPARK-38342 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > {code:java} > [WARNING] [Warn] > /spark-source/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:1459: > [deprecation @ > org.apache.spark.deploy.SparkSubmitUtils.resolveMavenCoordinates | > origin=org.apache.ivy.Ivy.retrieve | version=] method retrieve in class Ivy > is deprecated {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning
[ https://issues.apache.org/jira/browse/SPARK-38204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498706#comment-17498706 ] Apache Spark commented on SPARK-38204: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/35673 > All state operators are at a risk of inconsistency between state partitioning > and operator partitioning > --- > > Key: SPARK-38204 > URL: https://issues.apache.org/jira/browse/SPARK-38204 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.2.1, 3.3.0 >Reporter: Jungtaek Lim >Priority: Blocker > Labels: correctness > > Except stream-stream join, all stateful operators use ClusteredDistribution > as a requirement of child distribution. > ClusteredDistribution is very relaxed one - any output partitioning can > satisfy the distribution if the partitioning can ensure all tuples having > same grouping keys are placed in same partition. > To illustrate an example, support we do streaming aggregation like below code: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > In the code, streaming aggregation operator will be involved in physical > plan, which would have ClusteredDistribution("group1", "group2", "window"). > The problem is, various output partitionings can satisfy this distribution: > * RangePartitioning > ** This accepts exact and subset of the grouping key, with any order of keys > (combination), with any sort order (asc/desc) > * HashPartitioning > ** This accepts exact and subset of the grouping key, with any order of keys > (combination) > * (upcoming Spark 3.3.0+) DataSourcePartitioning > ** output partitioning provided by data source will be able to satisfy > ClusteredDistribution, which will make things worse (assuming data source can > provide different output partitioning relatively easier) > e.g. even we only consider HashPartitioning, HashPartitioning("group1"), > HashPartitioning("group2"), HashPartitioning("group1", "group2"), > HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", > "window"), etc. > The requirement of state partitioning is much more strict, since we should > not change the partitioning once it is partitioned and built. *It should > ensure that all tuples having same grouping keys are placed in same partition > (same partition ID) across query lifetime.* > *The impedance of distribution requirement between ClusteredDistribution and > state partitioning leads correctness issue silently.* > For example, let's assume we have a streaming query like below: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .repartition("group2") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > repartition("group2") satisfies ClusteredDistribution("group1", "group2", > "window"), so Spark won't introduce additional shuffle there, and state > partitioning would be HashPartitioning("group2"). > we run this query for a while, and stop the query, and change the manual > partitioning like below: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .repartition("group1") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > repartition("group1") also satisfies ClusteredDistribution("group1", > "group2", "window"), so Spark won't introduce additional shuffle there. That > said, child output partitioning of streaming aggregation operator would be > HashPartitioning("group1"), whereas state partitioning is > HashPartitioning("group2"). > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query] > In SS guide doc we enumerate the unsupported modifications of the query > during the lifetime of streaming query, but there is no notion of this. > Making this worse, Spark doesn't store any information on state partitioning > (that said, there is no way to validate), so *Spark simply allows this change > and brings up correctness issue while the streaming query runs like no > problem at all.* The only way to indicate the correctness is from the result > of the query. > We have no idea whether end users already suffer from this in their queries > or not. *The only way to look into is to list up all state rows and apply > hash function with expected grouping keys, and confirm all rows provide the > exact partition ID where they are in.* If it turns out as broken, we will > have to have a tool to “re”partition the state
[jira] [Commented] (SPARK-38342) Clean up deprecated api usage of Ivy
[ https://issues.apache.org/jira/browse/SPARK-38342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498705#comment-17498705 ] Apache Spark commented on SPARK-38342: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/35672 > Clean up deprecated api usage of Ivy > > > Key: SPARK-38342 > URL: https://issues.apache.org/jira/browse/SPARK-38342 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > {code:java} > [WARNING] [Warn] > /spark-source/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:1459: > [deprecation @ > org.apache.spark.deploy.SparkSubmitUtils.resolveMavenCoordinates | > origin=org.apache.ivy.Ivy.retrieve | version=] method retrieve in class Ivy > is deprecated {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38342) Clean up deprecated api usage of Ivy
[ https://issues.apache.org/jira/browse/SPARK-38342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38342: Assignee: Apache Spark > Clean up deprecated api usage of Ivy > > > Key: SPARK-38342 > URL: https://issues.apache.org/jira/browse/SPARK-38342 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > {code:java} > [WARNING] [Warn] > /spark-source/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:1459: > [deprecation @ > org.apache.spark.deploy.SparkSubmitUtils.resolveMavenCoordinates | > origin=org.apache.ivy.Ivy.retrieve | version=] method retrieve in class Ivy > is deprecated {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38342) Clean up deprecated api usage of Ivy
Yang Jie created SPARK-38342: Summary: Clean up deprecated api usage of Ivy Key: SPARK-38342 URL: https://issues.apache.org/jira/browse/SPARK-38342 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Yang Jie {code:java} [WARNING] [Warn] /spark-source/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:1459: [deprecation @ org.apache.spark.deploy.SparkSubmitUtils.resolveMavenCoordinates | origin=org.apache.ivy.Ivy.retrieve | version=] method retrieve in class Ivy is deprecated {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-38326) aditya
[ https://issues.apache.org/jira/browse/SPARK-38326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vallepu Durga Aditya closed SPARK-38326. final > aditya > -- > > Key: SPARK-38326 > URL: https://issues.apache.org/jira/browse/SPARK-38326 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Vallepu Durga Aditya >Priority: Major > Fix For: 3.2.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
[ https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] davon.cao updated SPARK-38341: -- Component/s: SQL (was: Spark Submit) > Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date > > > Key: SPARK-38341 > URL: https://issues.apache.org/jira/browse/SPARK-38341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: davon.cao >Priority: Major > > Step to reproduce: > Version of spark sql: 3.2.1(latest version in maven repository) > Run sql: > spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() > expect: 2020-05-31 > actual: 2020-05-30 (x) > > Version of spark sql: 2.4.3 > spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() > expect: 2020-05-31 > actual: 2020-05-31 (/) > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
[ https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] davon.cao updated SPARK-38341: -- Description: Step to reproduce: Version of spark sql: 3.2.1(latest version in maven repository) Run sql: spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() expect: 2020-05-31 actual: 2020-05-30 (x) Version of spark sql: 2.4.3 spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() expect: 2020-05-31 actual: 2020-05-31 (/) was: Step to reproduce: Version of spark sql: 3.2.1(latest version in maven repository) Run sql: spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() expect: 2020-05-31 actual: 2020-05-30 Version of spark sql: 2.4.3 (/) spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() expect: 2020-05-31 actual: 2020-05-31 > Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date > > > Key: SPARK-38341 > URL: https://issues.apache.org/jira/browse/SPARK-38341 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.2.1 >Reporter: davon.cao >Priority: Major > > Step to reproduce: > Version of spark sql: 3.2.1(latest version in maven repository) > Run sql: > spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() > expect: 2020-05-31 > actual: 2020-05-30 (x) > > Version of spark sql: 2.4.3 > spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() > expect: 2020-05-31 > actual: 2020-05-31 (/) > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
[ https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] davon.cao updated SPARK-38341: -- Summary: Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date (was: Spark sql - Function of add_ Months returns an incorrect date) > Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date > > > Key: SPARK-38341 > URL: https://issues.apache.org/jira/browse/SPARK-38341 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.2.1 >Reporter: davon.cao >Priority: Major > > Step to reproduce: > Version of spark sql: 3.2.1(latest version in maven repository) > Run sql: > spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() > expect: 2020-05-31 > actual: 2020-05-30 > > Version of spark sql: 2.4.3 (/) > spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() > expect: 2020-05-31 > actual: 2020-05-31 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38341) Spark sql - Function of add_ Months returns an incorrect date
[ https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] davon.cao updated SPARK-38341: -- Description: Step to reproduce: Version of spark sql: 3.2.1(latest version in maven repository) Run sql: spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() expect: 2020-05-31 actual: 2020-05-30 Version of spark sql: 2.4.3 (/) spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() expect: 2020-05-31 actual: 2020-05-31 was: Step to reproduce: Version of spark sql: 3.2.1(latest version in maven repository) Run sql: spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() expect: 2020-05-31 actual: 2020-05-30 Version of spark sql: 2.4.3 spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() expect: 2020-05-31 actual: 2020-05-31 > Spark sql - Function of add_ Months returns an incorrect date > - > > Key: SPARK-38341 > URL: https://issues.apache.org/jira/browse/SPARK-38341 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.2.1 >Reporter: davon.cao >Priority: Major > > Step to reproduce: > Version of spark sql: 3.2.1(latest version in maven repository) > Run sql: > spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() > expect: 2020-05-31 > actual: 2020-05-30 > > Version of spark sql: 2.4.3 (/) > spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() > expect: 2020-05-31 > actual: 2020-05-31 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38331) csv parser exception when quote and escape are both double-quote and a value is just "," and column pruning enabled
[ https://issues.apache.org/jira/browse/SPARK-38331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38331: - Component/s: SQL (was: Input/Output) > csv parser exception when quote and escape are both double-quote and a value > is just "," and column pruning enabled > --- > > Key: SPARK-38331 > URL: https://issues.apache.org/jira/browse/SPARK-38331 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1 >Reporter: Christopher Auston >Priority: Minor > > Workaround: disable column pruning. > Example pyspark code (from Databricks): > {noformat} > import pyspark > print(pyspark.version.__version__) > # enable column pruning (reset default value) > spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'true') > dbutils.fs.put(file='/tmp/example.csv', > contents='''"col1","b4_comma","comma","col4" > "","",",","x" > ''', overwrite=True) > df = spark.read.csv( > path='/tmp/example.csv' > ,inferSchema=True > ,header=True > ,escape='"' > ,multiLine=True > ,unescapedQuoteHandling='RAISE_ERROR' > ,mode='FAILFAST' > ) > ex = None > try: > df.select(df.col1,df.comma).take(1) > except Exception as e: > ex = e > > if ex: > print('[pruning] Exception is raised if b4_comma is NOT selected') > > df.select(df.b4_comma, df.comma).take(1) > print('[pruning] No exception if b4_comma is selected') > ex = None > try: > df.count() > except Exception as e: > ex = e > > if ex: > print('[pruning] Exception raised by count') > print('\ndisabling pruning\n') > > > # disable column pruning > spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'false') > df.select(df.col1,df.comma).take(1) > print('[no prune] No exception if b4_comma is NOT selected') {noformat} > > Output: > {noformat} > 3.1.2 > Wrote 47 bytes. > [pruning] Exception is raised if b4_comma is NOT selected > [pruning] No exception if b4_comma is selected > [pruning] Exception raised by count > disabling pruning > [no prune] No exception if b4_comma is NOT selected {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS
[ https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498678#comment-17498678 ] Hyukjin Kwon commented on SPARK-38329: -- Spark 2.4.X is EOL. Can you test and see if the issue persists in Spark 3+? > High I/O wait when Spark Structured Streaming checkpoint changed to EFS > --- > > Key: SPARK-38329 > URL: https://issues.apache.org/jira/browse/SPARK-38329 > Project: Spark > Issue Type: Question > Components: EC2, Input/Output, PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Neven Jovic >Priority: Major > Attachments: Screenshot from 2022-02-25 14-16-11.png > > > I'm currently running spark structured streaming application written in > python(pyspark) where my source is kafka topic and sink i mongodb. I changed > my checkpoint to Amazon EFS, which is distributed on all spark workers and > after that I got increased I/o wait, averaging 8% > > !Screenshot from 2022-02-25 14-16-11.png! > Currently I have 6000 messages coming to kafka every second, and I get every > once in a while a WARN message: > {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up > files for HDFSStateStoreProvider[id = (op=0,part=90),dir = > file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For > input string: "" > {quote} > I'm not quite sure if that message has anything to do with high I/O wait and > is this behavior expected, or something to be concerned about? > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38327) JDBC Source with MariaDB connection returns column names as values
[ https://issues.apache.org/jira/browse/SPARK-38327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498677#comment-17498677 ] Hyukjin Kwon commented on SPARK-38327: -- I think it needs a MariaDB dialect that implements https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala > JDBC Source with MariaDB connection returns column names as values > -- > > Key: SPARK-38327 > URL: https://issues.apache.org/jira/browse/SPARK-38327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 > Environment: MariaDB version 10.3.10 > Running with spark-k8s-operator >Reporter: Marvin Rösch >Priority: Minor > > Using a JDBC source with the official MariaDB JDBC driver and a JDBC > connection URL like the following does not work as expected: > {noformat} > jdbc:mariadb://db.example.com:3306/schema {noformat} > Assume we have a table "values" like the following in MariaDB: > ||id (binary)||name (varchar)|| > |0xAB|Name 1| > |0xBC|Name 2| > We intend to create and display a data frame from it like this: > {code:scala} > spark.read > .format("jdbc") > .option("url", "jdbc:mariadb://db.example.com:3306/schema") > .option("dbtable", "values") > .load() > .show{code} > *Expected Behavior* > Using such a connection URL on an arbitrary MariaDB table or query results in > a data frame that reflects the table structure and content from MariaDB > correctly, with columns having the correct type and values. > The output of the above should be > {noformat} > ++--+ > | id| name| > ++--+ > |[AB]|Name 1| > |[BC]|Name 2| > ++--+{noformat} > *Observed Behavior* > Result rows contain column names as values, making them effectively useless > to work with. > The actual output is > {noformat} > +---++ > | id|name| > +---++ > |[69 64]|name| > |[69 64]|name| > +---++{noformat} > *Further information* > An easy workaround appears to be specifying "mysql" instead of "mariadb" in > the connection URL while explicitly specifying the MariaDB driver. I'd expect > the mariadb URL to work out of the box, however. > It looks like this has been an issue since at least 2016 according to a > [StackOverflow > post|https://stackoverflow.com/questions/38808463/incorrect-data-while-loading-jdbc-table-in-spark-sql]. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38326) aditya
[ https://issues.apache.org/jira/browse/SPARK-38326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38326. -- Resolution: Invalid > aditya > -- > > Key: SPARK-38326 > URL: https://issues.apache.org/jira/browse/SPARK-38326 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Vallepu Durga Aditya >Priority: Major > Fix For: 3.2.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38341) Spark sql - Function of add_ Months returns an incorrect date
davon.cao created SPARK-38341: - Summary: Spark sql - Function of add_ Months returns an incorrect date Key: SPARK-38341 URL: https://issues.apache.org/jira/browse/SPARK-38341 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 3.2.1 Reporter: davon.cao Step to reproduce: Version of spark sql: 3.2.1(latest version in maven repository) Run sql: spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() expect: 2020-05-31 actual: 2020-05-30 Version of spark sql: 2.4.3 spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas() expect: 2020-05-31 actual: 2020-05-31 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38337) Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to cleanup deprecated api usage
[ https://issues.apache.org/jira/browse/SPARK-38337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38337. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35665 [https://github.com/apache/spark/pull/35665] > Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to > cleanup deprecated api usage > -- > > Key: SPARK-38337 > URL: https://issues.apache.org/jira/browse/SPARK-38337 > Project: Spark > Issue Type: Improvement > Components: DStreams, MLlib, Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > Attachments: Screenshot_20220227-050659.png > > > In Scala 2.12, {{IterableLike.toIterator}} identified as > {{{}@deprecatedOverriding{}}}: > > {code:java} > @deprecatedOverriding("toIterator should stay consistent with iterator for > all Iterables: override iterator instead.", "2.11.0") > override def toIterator: Iterator[A] = iterator {code} > In Scala 2.13, {{IterableOnce.toIterator}} identified as {{{}@deprecated{}}}: > {code:java} > @deprecated("Use .iterator instead of .toIterator", "2.13.0") @`inline` final > def toIterator: Iterator[A] = iterator {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38337) Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to cleanup deprecated api usage
[ https://issues.apache.org/jira/browse/SPARK-38337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38337: Assignee: Yang Jie > Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to > cleanup deprecated api usage > -- > > Key: SPARK-38337 > URL: https://issues.apache.org/jira/browse/SPARK-38337 > Project: Spark > Issue Type: Improvement > Components: DStreams, MLlib, Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Attachments: Screenshot_20220227-050659.png > > > In Scala 2.12, {{IterableLike.toIterator}} identified as > {{{}@deprecatedOverriding{}}}: > > {code:java} > @deprecatedOverriding("toIterator should stay consistent with iterator for > all Iterables: override iterator instead.", "2.11.0") > override def toIterator: Iterator[A] = iterator {code} > In Scala 2.13, {{IterableOnce.toIterator}} identified as {{{}@deprecated{}}}: > {code:java} > @deprecated("Use .iterator instead of .toIterator", "2.13.0") @`inline` final > def toIterator: Iterator[A] = iterator {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38338) Remove test dependency of hamcrest
[ https://issues.apache.org/jira/browse/SPARK-38338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38338. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35666 [https://github.com/apache/spark/pull/35666] > Remove test dependency of hamcrest > -- > > Key: SPARK-38338 > URL: https://issues.apache.org/jira/browse/SPARK-38338 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > > SPARK-7081 introduces test dependency on hamcrest, but the current Spark UTs > doesn't rely too much on this library. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38338) Remove test dependency of hamcrest
[ https://issues.apache.org/jira/browse/SPARK-38338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38338: Assignee: Yang Jie > Remove test dependency of hamcrest > -- > > Key: SPARK-38338 > URL: https://issues.apache.org/jira/browse/SPARK-38338 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > SPARK-7081 introduces test dependency on hamcrest, but the current Spark UTs > doesn't rely too much on this library. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38339) Upgrade RoaringBitmap to 0.9.25
[ https://issues.apache.org/jira/browse/SPARK-38339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38339. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35668 [https://github.com/apache/spark/pull/35668] > Upgrade RoaringBitmap to 0.9.25 > --- > > Key: SPARK-38339 > URL: https://issues.apache.org/jira/browse/SPARK-38339 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38339) Upgrade RoaringBitmap to 0.9.25
[ https://issues.apache.org/jira/browse/SPARK-38339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38339: Assignee: Yang Jie > Upgrade RoaringBitmap to 0.9.25 > --- > > Key: SPARK-38339 > URL: https://issues.apache.org/jira/browse/SPARK-38339 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38288) Aggregate push down doesnt work using Spark SQL jdbc datasource with postgresql
[ https://issues.apache.org/jira/browse/SPARK-38288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498669#comment-17498669 ] Daniel Fernández commented on SPARK-38288: -- [~andrewfmurphy] however in [33352|https://github.com/apache/spark/pull/33352] and [33526|https://github.com/apache/spark/pull/33526] it is said that a JDBC implementation of aggregate pushdown has already been merged, including associated tests for H2. > Aggregate push down doesnt work using Spark SQL jdbc datasource with > postgresql > --- > > Key: SPARK-38288 > URL: https://issues.apache.org/jira/browse/SPARK-38288 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.2.1 >Reporter: Luis Lozano Coira >Priority: Major > Labels: DataSource, Spark-SQL > > I am establishing a connection with postgresql using the Spark SQL jdbc > datasource. I have started the spark shell including the postgres driver and > I can connect and execute queries without problems. I am using this statement: > {code:java} > val df = spark.read.format("jdbc").option("url", > "jdbc:postgresql://host:port/").option("driver", > "org.postgresql.Driver").option("dbtable", "test").option("user", > "postgres").option("password", > "***").option("pushDownAggregate",true).load() > {code} > I am adding the pushDownAggregate option because I would like the > aggregations are delegated to the source. But for some reason this is not > happening. > Reviewing this pull request, it seems that this feature should be merged into > 3.2. [https://github.com/apache/spark/pull/29695] > I am making the aggregations considering the mentioned limitations. An > example case where I don't see pushdown being done would be this one: > {code:java} > df.groupBy("name").max("age").show() > {code} > The results of the queryExecution are shown below: > {code:java} > scala> df.groupBy("name").max("age").queryExecution.executedPlan > res19: org.apache.spark.sql.execution.SparkPlan = > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[name#274], functions=[max(age#246)], output=[name#274, > max(age)#544]) >+- Exchange hashpartitioning(name#274, 200), ENSURE_REQUIREMENTS, [id=#205] > +- HashAggregate(keys=[name#274], functions=[partial_max(age#246)], > output=[name#274, max#548]) > +- Scan JDBCRelation(test) [numPartitions=1] [age#246,name#274] > PushedAggregates: [], PushedFilters: [], PushedGroupby: [], ReadSchema: > struct > scala> dfp.groupBy("name").max("age").queryExecution.toString > res20: String = > "== Parsed Logical Plan == > Aggregate [name#274], [name#274, max(age#246) AS max(age)#581] > +- Relation [age#246] JDBCRelation(test) [numPartitions=1] > == Analyzed Logical Plan == > name: string, max(age): int > Aggregate [name#274], [name#274, max(age#246) AS max(age)#581] > +- Relation [age#24... > {code} > What could be the problem? Should pushDownAggregate work in this case? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38288) Aggregate push down doesnt work using Spark SQL jdbc datasource with postgresql
[ https://issues.apache.org/jira/browse/SPARK-38288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498668#comment-17498668 ] Andrew Murphy commented on SPARK-38288: --- Hi [~llozano] I believe this is because JDBC DataSource V2 has not been fully implemented. Even though [29695|https://github.com/apache/spark/pull/29695] has merged, reading from a JDBC connection still defaults to JDBC DataSource V1. > Aggregate push down doesnt work using Spark SQL jdbc datasource with > postgresql > --- > > Key: SPARK-38288 > URL: https://issues.apache.org/jira/browse/SPARK-38288 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.2.1 >Reporter: Luis Lozano Coira >Priority: Major > Labels: DataSource, Spark-SQL > > I am establishing a connection with postgresql using the Spark SQL jdbc > datasource. I have started the spark shell including the postgres driver and > I can connect and execute queries without problems. I am using this statement: > {code:java} > val df = spark.read.format("jdbc").option("url", > "jdbc:postgresql://host:port/").option("driver", > "org.postgresql.Driver").option("dbtable", "test").option("user", > "postgres").option("password", > "***").option("pushDownAggregate",true).load() > {code} > I am adding the pushDownAggregate option because I would like the > aggregations are delegated to the source. But for some reason this is not > happening. > Reviewing this pull request, it seems that this feature should be merged into > 3.2. [https://github.com/apache/spark/pull/29695] > I am making the aggregations considering the mentioned limitations. An > example case where I don't see pushdown being done would be this one: > {code:java} > df.groupBy("name").max("age").show() > {code} > The results of the queryExecution are shown below: > {code:java} > scala> df.groupBy("name").max("age").queryExecution.executedPlan > res19: org.apache.spark.sql.execution.SparkPlan = > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[name#274], functions=[max(age#246)], output=[name#274, > max(age)#544]) >+- Exchange hashpartitioning(name#274, 200), ENSURE_REQUIREMENTS, [id=#205] > +- HashAggregate(keys=[name#274], functions=[partial_max(age#246)], > output=[name#274, max#548]) > +- Scan JDBCRelation(test) [numPartitions=1] [age#246,name#274] > PushedAggregates: [], PushedFilters: [], PushedGroupby: [], ReadSchema: > struct > scala> dfp.groupBy("name").max("age").queryExecution.toString > res20: String = > "== Parsed Logical Plan == > Aggregate [name#274], [name#274, max(age#246) AS max(age)#581] > +- Relation [age#246] JDBCRelation(test) [numPartitions=1] > == Analyzed Logical Plan == > name: string, max(age): int > Aggregate [name#274], [name#274, max(age#246) AS max(age)#581] > +- Relation [age#24... > {code} > What could be the problem? Should pushDownAggregate work in this case? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498640#comment-17498640 ] Sean R. Owen commented on SPARK-25075: -- Unknown, though I'd guess not this year. What depends on that though, dropping 2.12 support? > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Assignee: Sean R. Owen >Priority: Major > Fix For: 3.2.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498639#comment-17498639 ] Ismael Juma commented on SPARK-25075: - Is there a very rough timeline for 4.0 or it completely unknown at this stage? > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Assignee: Sean R. Owen >Priority: Major > Fix For: 3.2.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498639#comment-17498639 ] Ismael Juma edited comment on SPARK-25075 at 2/27/22, 5:39 PM: --- Is there a very rough timeline for 4.0 or is it completely unknown at this stage? was (Author: ijuma): Is there a very rough timeline for 4.0 or it completely unknown at this stage? > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Assignee: Sean R. Owen >Priority: Major > Fix For: 3.2.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498622#comment-17498622 ] Sean R. Owen commented on SPARK-25075: -- I don't think that's the plan. Certainly not to remove 2.12 before 4.0, but probably not to change defaults soon either. > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Assignee: Sean R. Owen >Priority: Major > Fix For: 3.2.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498615#comment-17498615 ] Yang Jie commented on SPARK-25075: -- Do we plan to switch Scala 2.13 to the default Scala version in Spark 3.3? > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Assignee: Sean R. Owen >Priority: Major > Fix For: 3.2.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38112) Use error classes in the execution errors of date/timestamp handling
[ https://issues.apache.org/jira/browse/SPARK-38112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498612#comment-17498612 ] Apache Spark commented on SPARK-38112: -- User 'ivoson' has created a pull request for this issue: https://github.com/apache/spark/pull/35670 > Use error classes in the execution errors of date/timestamp handling > > > Key: SPARK-38112 > URL: https://issues.apache.org/jira/browse/SPARK-38112 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * sparkUpgradeInReadingDatesError > * sparkUpgradeInWritingDatesError > * timeZoneIdNotSpecifiedForTimestampTypeError > * cannotConvertOrcTimestampToTimestampNTZError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38112) Use error classes in the execution errors of date/timestamp handling
[ https://issues.apache.org/jira/browse/SPARK-38112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498611#comment-17498611 ] Apache Spark commented on SPARK-38112: -- User 'ivoson' has created a pull request for this issue: https://github.com/apache/spark/pull/35670 > Use error classes in the execution errors of date/timestamp handling > > > Key: SPARK-38112 > URL: https://issues.apache.org/jira/browse/SPARK-38112 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * sparkUpgradeInReadingDatesError > * sparkUpgradeInWritingDatesError > * timeZoneIdNotSpecifiedForTimestampTypeError > * cannotConvertOrcTimestampToTimestampNTZError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38112) Use error classes in the execution errors of date/timestamp handling
[ https://issues.apache.org/jira/browse/SPARK-38112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38112: Assignee: (was: Apache Spark) > Use error classes in the execution errors of date/timestamp handling > > > Key: SPARK-38112 > URL: https://issues.apache.org/jira/browse/SPARK-38112 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * sparkUpgradeInReadingDatesError > * sparkUpgradeInWritingDatesError > * timeZoneIdNotSpecifiedForTimestampTypeError > * cannotConvertOrcTimestampToTimestampNTZError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38112) Use error classes in the execution errors of date/timestamp handling
[ https://issues.apache.org/jira/browse/SPARK-38112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38112: Assignee: Apache Spark > Use error classes in the execution errors of date/timestamp handling > > > Key: SPARK-38112 > URL: https://issues.apache.org/jira/browse/SPARK-38112 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * sparkUpgradeInReadingDatesError > * sparkUpgradeInWritingDatesError > * timeZoneIdNotSpecifiedForTimestampTypeError > * cannotConvertOrcTimestampToTimestampNTZError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38041) DataFilter pushed down with PartitionFilter
[ https://issues.apache.org/jira/browse/SPARK-38041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38041: Assignee: (was: Apache Spark) > DataFilter pushed down with PartitionFilter > --- > > Key: SPARK-38041 > URL: https://issues.apache.org/jira/browse/SPARK-38041 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jackey Lee >Priority: Major > > At present, the Filter is divided into DataFilter and PartitionFilter when it > is pushed down, but when the Filter removes the PartitionFilter, it means > that all Partitions will scan all DataFilter conditions, which may cause full > data scan. > Here is a example. > before > {code:java} > == Physical Plan == > *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= > 1)) AND (c#42 < 3))) > +- *(1) ColumnarToRow > +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L > < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], > Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: > [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], > PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedGroupBy: > [], ReadSchema: struct, PushedFilters: > [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedAggregation: [], > PushedGroupBy: [] RuntimeFilters: [] > {code} > after > {code:java} > == Physical Plan == > *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= > 1)) AND (c#42 < 3))) > +- *(1) ColumnarToRow > +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L > < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], > Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: > [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], > PushedFilters: > [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),Le..., > PushedGroupBy: [], ReadSchema: struct, PushedFilters: > [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),LessThan(c,3)))], > PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38041) DataFilter pushed down with PartitionFilter
[ https://issues.apache.org/jira/browse/SPARK-38041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498598#comment-17498598 ] Apache Spark commented on SPARK-38041: -- User 'stczwd' has created a pull request for this issue: https://github.com/apache/spark/pull/35669 > DataFilter pushed down with PartitionFilter > --- > > Key: SPARK-38041 > URL: https://issues.apache.org/jira/browse/SPARK-38041 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jackey Lee >Priority: Major > > At present, the Filter is divided into DataFilter and PartitionFilter when it > is pushed down, but when the Filter removes the PartitionFilter, it means > that all Partitions will scan all DataFilter conditions, which may cause full > data scan. > Here is a example. > before > {code:java} > == Physical Plan == > *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= > 1)) AND (c#42 < 3))) > +- *(1) ColumnarToRow > +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L > < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], > Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: > [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], > PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedGroupBy: > [], ReadSchema: struct, PushedFilters: > [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedAggregation: [], > PushedGroupBy: [] RuntimeFilters: [] > {code} > after > {code:java} > == Physical Plan == > *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= > 1)) AND (c#42 < 3))) > +- *(1) ColumnarToRow > +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L > < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], > Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: > [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], > PushedFilters: > [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),Le..., > PushedGroupBy: [], ReadSchema: struct, PushedFilters: > [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),LessThan(c,3)))], > PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38041) DataFilter pushed down with PartitionFilter
[ https://issues.apache.org/jira/browse/SPARK-38041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38041: Assignee: Apache Spark > DataFilter pushed down with PartitionFilter > --- > > Key: SPARK-38041 > URL: https://issues.apache.org/jira/browse/SPARK-38041 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jackey Lee >Assignee: Apache Spark >Priority: Major > > At present, the Filter is divided into DataFilter and PartitionFilter when it > is pushed down, but when the Filter removes the PartitionFilter, it means > that all Partitions will scan all DataFilter conditions, which may cause full > data scan. > Here is a example. > before > {code:java} > == Physical Plan == > *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= > 1)) AND (c#42 < 3))) > +- *(1) ColumnarToRow > +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L > < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], > Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: > [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], > PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedGroupBy: > [], ReadSchema: struct, PushedFilters: > [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedAggregation: [], > PushedGroupBy: [] RuntimeFilters: [] > {code} > after > {code:java} > == Physical Plan == > *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= > 1)) AND (c#42 < 3))) > +- *(1) ColumnarToRow > +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L > < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], > Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: > [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], > PushedFilters: > [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),Le..., > PushedGroupBy: [], ReadSchema: struct, PushedFilters: > [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),LessThan(c,3)))], > PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38041) DataFilter pushed down with PartitionFilter
[ https://issues.apache.org/jira/browse/SPARK-38041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jackey Lee updated SPARK-38041: --- Description: At present, the Filter is divided into DataFilter and PartitionFilter when it is pushed down, but when the Filter removes the PartitionFilter, it means that all Partitions will scan all DataFilter conditions, which may cause full data scan. Here is a example. before {code:java} == Physical Plan == *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3))) +- *(1) ColumnarToRow +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedGroupBy: [], ReadSchema: struct, PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] {code} after {code:java} == Physical Plan == *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3))) +- *(1) ColumnarToRow +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], PushedFilters: [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),Le..., PushedGroupBy: [], ReadSchema: struct, PushedFilters: [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),LessThan(c,3)))], PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] {code} was: At present, the Filter is divided into DataFilter and PartitionFilter when it is pushed down, but when the Filter removes the PartitionFilter, it means that all Partitions will scan all DataFilter conditions, which may cause full data scan. Here is a example. before {code:java} == Physical Plan == *(1) Filter (((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND (c#2 < 3))) +- *(1) ColumnarToRow +- FileScan parquet datasources.test_push_down[a#0,b#1,c#2] Batched: true, DataFilters: [((a#0 < 10) OR (a#0 >= 10))], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [((c#2 = 0) OR ((c#2 >= 1) AND (c#2 < 3)))], PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], ReadSchema: struct {code} after {code:java} == Physical Plan == *(1) Filter (((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND (c#2 < 3))) +- *(1) ColumnarToRow +- FileScan parquet datasources.test_push_down[a#0,b#1,c#2] Batched: true, DataFilters: [(((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND (c#2 < 3)))], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [((c#2 = 0) OR ((c#2 >= 1) AND (c#2 < 3)))], PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], ReadSchema: struct {code} > DataFilter pushed down with PartitionFilter > --- > > Key: SPARK-38041 > URL: https://issues.apache.org/jira/browse/SPARK-38041 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jackey Lee >Priority: Major > > At present, the Filter is divided into DataFilter and PartitionFilter when it > is pushed down, but when the Filter removes the PartitionFilter, it means > that all Partitions will scan all DataFilter conditions, which may cause full > data scan. > Here is a example. > before > {code:java} > == Physical Plan == > *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= > 1)) AND (c#42 < 3))) > +- *(1) ColumnarToRow > +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L > < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], > Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: > [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], > PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedGroupBy: > [], ReadSchema: struct, PushedFilters: > [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedAggregation: [], > PushedGroupBy: [] RuntimeFilters: [] > {code} > after > {code:java} > == Physical Plan == > *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= > 1)) AND (c#42 < 3))) > +- *(1) ColumnarToRow > +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L > < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 <
[jira] [Updated] (SPARK-38340) Upgrade protobuf-java from 2.5.0 to 3.16.1
[ https://issues.apache.org/jira/browse/SPARK-38340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-38340: Description: [CVE-2021-22569|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-22569] To do this upgrade I have done external/kinesis-asl-assembly/pom.xml change line 65 to 3.16.1 pom.xml change line 124 to 3.16.1 run ./dev/test-dependencies.sh --replace-manifest witch change dev/deps/spark-deps-hadoop-2-hive-2.3 line 235 to protobuf-java/3.16.1//protobuf-java-3.16.1.jar and dev/deps/spark-deps-hadoop-3-hive-2.3 to protobuf-java/3.16.1//protobuf-java-3.16.1.jar My branch [protobuf-java-from-2.5.0-to-3.16.1|https://github.com/bjornjorgensen/spark/tree/protobuf-java-from-2.5.0-to-3.16.1] is OK with testes, but when I run ./build/mvn -DskipTests clean package && ./build/mvn -e package I get this error: 01:01:41.381 WARN org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReaderSuite: = POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.orc.OrcColumnarBatchReaderSuite, threads: rpc-boss-3348-1 (daemon=true), shuffle-boss-3351-1 (daemon=true) = Run completed in 1 hour, 7 minutes, 35 seconds. Total number of tests run: 11260 Suites: completed 505, aborted 0 Tests: succeeded 11259, failed 1, canceled 5, ignored 57, pending 0 *** 1 TEST FAILED *** [INFO] [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 3.396 s] [INFO] Spark Project Tags . SUCCESS [ 7.374 s] [INFO] Spark Project Sketch ... SUCCESS [ 9.324 s] [INFO] Spark Project Local DB . SUCCESS [ 4.097 s] [INFO] Spark Project Networking ... SUCCESS [ 47.468 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 10.478 s] [INFO] Spark Project Unsafe ... SUCCESS [ 2.425 s] [INFO] Spark Project Launcher . SUCCESS [ 2.767 s] [INFO] Spark Project Core . SUCCESS [30:56 min] [INFO] Spark Project ML Local Library . SUCCESS [ 29.105 s] [INFO] Spark Project GraphX ... SUCCESS [02:09 min] [INFO] Spark Project Streaming SUCCESS [05:21 min] [INFO] Spark Project Catalyst . SUCCESS [08:15 min] [INFO] Spark Project SQL .. FAILURE [ 01:11 h] [INFO] Spark Project ML Library ... SKIPPED [INFO] Spark Project Tools SKIPPED [INFO] Spark Project Hive . SKIPPED [INFO] Spark Project REPL . SKIPPED [INFO] Spark Project Assembly . SKIPPED [INFO] Kafka 0.10+ Token Provider for Streaming ... SKIPPED [INFO] Spark Integration for Kafka 0.10 ... SKIPPED [INFO] Kafka 0.10+ Source for Structured Streaming SKIPPED [INFO] Spark Project Examples . SKIPPED [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED [INFO] Spark Avro . SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 02:00 h [INFO] Finished at: 2022-02-27T01:01:44+01:00 [INFO] [ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project spark-sql_2.12: There are test failures -> [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project spark-sql_2.12: There are test failures at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
[jira] [Created] (SPARK-38340) Upgrade protobuf-java from 2.5.0 to 3.16.1
Bjørn Jørgensen created SPARK-38340: --- Summary: Upgrade protobuf-java from 2.5.0 to 3.16.1 Key: SPARK-38340 URL: https://issues.apache.org/jira/browse/SPARK-38340 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.3.0 Reporter: Bjørn Jørgensen [CVE-2021-22569|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-22569] To do this upgrade I have don external/kinesis-asl-assembly/pom.xml change line 65 to 3.16.1 pom.xml change line 124 to 3.16.1 run ./dev/test-dependencies.sh --replace-manifest witch change dev/deps/spark-deps-hadoop-2-hive-2.3 line 235 to protobuf-java/3.16.1//protobuf-java-3.16.1.jar and dev/deps/spark-deps-hadoop-3-hive-2.3 to protobuf-java/3.16.1//protobuf-java-3.16.1.jar My branch [protobuf-java-from-2.5.0-to-3.16.1|https://github.com/bjornjorgensen/spark/tree/protobuf-java-from-2.5.0-to-3.16.1] is OK with testes, but when I run ./build/mvn -DskipTests clean package && ./build/mvn -e package I get this error: 01:01:41.381 WARN org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReaderSuite: = POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.orc.OrcColumnarBatchReaderSuite, threads: rpc-boss-3348-1 (daemon=true), shuffle-boss-3351-1 (daemon=true) = Run completed in 1 hour, 7 minutes, 35 seconds. Total number of tests run: 11260 Suites: completed 505, aborted 0 Tests: succeeded 11259, failed 1, canceled 5, ignored 57, pending 0 *** 1 TEST FAILED *** [INFO] [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 3.396 s] [INFO] Spark Project Tags . SUCCESS [ 7.374 s] [INFO] Spark Project Sketch ... SUCCESS [ 9.324 s] [INFO] Spark Project Local DB . SUCCESS [ 4.097 s] [INFO] Spark Project Networking ... SUCCESS [ 47.468 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 10.478 s] [INFO] Spark Project Unsafe ... SUCCESS [ 2.425 s] [INFO] Spark Project Launcher . SUCCESS [ 2.767 s] [INFO] Spark Project Core . SUCCESS [30:56 min] [INFO] Spark Project ML Local Library . SUCCESS [ 29.105 s] [INFO] Spark Project GraphX ... SUCCESS [02:09 min] [INFO] Spark Project Streaming SUCCESS [05:21 min] [INFO] Spark Project Catalyst . SUCCESS [08:15 min] [INFO] Spark Project SQL .. FAILURE [ 01:11 h] [INFO] Spark Project ML Library ... SKIPPED [INFO] Spark Project Tools SKIPPED [INFO] Spark Project Hive . SKIPPED [INFO] Spark Project REPL . SKIPPED [INFO] Spark Project Assembly . SKIPPED [INFO] Kafka 0.10+ Token Provider for Streaming ... SKIPPED [INFO] Spark Integration for Kafka 0.10 ... SKIPPED [INFO] Kafka 0.10+ Source for Structured Streaming SKIPPED [INFO] Spark Project Examples . SKIPPED [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED [INFO] Spark Avro . SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 02:00 h [INFO] Finished at: 2022-02-27T01:01:44+01:00 [INFO] [ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project spark-sql_2.12: There are test failures -> [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project spark-sql_2.12: There are test failures at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute
[jira] [Assigned] (SPARK-38339) Upgrade RoaringBitmap to 0.9.25
[ https://issues.apache.org/jira/browse/SPARK-38339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38339: Assignee: (was: Apache Spark) > Upgrade RoaringBitmap to 0.9.25 > --- > > Key: SPARK-38339 > URL: https://issues.apache.org/jira/browse/SPARK-38339 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38339) Upgrade RoaringBitmap to 0.9.25
[ https://issues.apache.org/jira/browse/SPARK-38339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38339: Assignee: Apache Spark > Upgrade RoaringBitmap to 0.9.25 > --- > > Key: SPARK-38339 > URL: https://issues.apache.org/jira/browse/SPARK-38339 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38339) Upgrade RoaringBitmap to 0.9.25
[ https://issues.apache.org/jira/browse/SPARK-38339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498577#comment-17498577 ] Apache Spark commented on SPARK-38339: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/35668 > Upgrade RoaringBitmap to 0.9.25 > --- > > Key: SPARK-38339 > URL: https://issues.apache.org/jira/browse/SPARK-38339 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38339) Upgrade RoaringBitmap to 0.9.25
Yang Jie created SPARK-38339: Summary: Upgrade RoaringBitmap to 0.9.25 Key: SPARK-38339 URL: https://issues.apache.org/jira/browse/SPARK-38339 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.3.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38337) Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to cleanup deprecated api usage
[ https://issues.apache.org/jira/browse/SPARK-38337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jj updated SPARK-38337: --- Attachment: Screenshot_20220227-050659.png > Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to > cleanup deprecated api usage > -- > > Key: SPARK-38337 > URL: https://issues.apache.org/jira/browse/SPARK-38337 > Project: Spark > Issue Type: Improvement > Components: DStreams, MLlib, Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > Attachments: Screenshot_20220227-050659.png > > > In Scala 2.12, {{IterableLike.toIterator}} identified as > {{{}@deprecatedOverriding{}}}: > > {code:java} > @deprecatedOverriding("toIterator should stay consistent with iterator for > all Iterables: override iterator instead.", "2.11.0") > override def toIterator: Iterator[A] = iterator {code} > In Scala 2.13, {{IterableOnce.toIterator}} identified as {{{}@deprecated{}}}: > {code:java} > @deprecated("Use .iterator instead of .toIterator", "2.13.0") @`inline` final > def toIterator: Iterator[A] = iterator {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org