[jira] [Assigned] (SPARK-35440) Add language type to `ExpressionInfo` for UDF
[ https://issues.apache.org/jira/browse/SPARK-35440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35440: --- Assignee: Linhong Liu > Add language type to `ExpressionInfo` for UDF > - > > Key: SPARK-35440 > URL: https://issues.apache.org/jira/browse/SPARK-35440 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > > add "scala", "java", "python", "hive", "built-in" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35440) Add language type to `ExpressionInfo` for UDF
[ https://issues.apache.org/jira/browse/SPARK-35440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35440. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32587 [https://github.com/apache/spark/pull/32587] > Add language type to `ExpressionInfo` for UDF > - > > Key: SPARK-35440 > URL: https://issues.apache.org/jira/browse/SPARK-35440 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > Fix For: 3.2.0 > > > add "scala", "java", "python", "hive", "built-in" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35527) Fix HiveExternalCatalogVersionsSuite to pass with Java 11
[ https://issues.apache.org/jira/browse/SPARK-35527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35527: Assignee: Kousuke Saruta (was: Apache Spark) > Fix HiveExternalCatalogVersionsSuite to pass with Java 11 > - > > Key: SPARK-35527 > URL: https://issues.apache.org/jira/browse/SPARK-35527 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > I'm personally checking whether all the tests pass with Java 11 for the > current master and I found HiveExternalCatalogVersionsSuite fails. > The reason is that Spark 3.0.2 and 3.1.1 doesn't accept 2.3.8 as a hive > metastore version. > HiveExternalCatalogVersionsSuite downloads Spark releases from > https://dist.apache.org/repos/dist/release/spark/ and run test for each > release. The Spark releases are 3.0.2 and 3.1.1 for now. > With Java 11, the suite runs with a hive metastore version which corresponds > to the builtin Hive version and it's 2.3.8 for the current master. > But branch-3.0 and branch-3.1 doesn't accept 2.3.8, the suite with Java 11 > fails. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35527) Fix HiveExternalCatalogVersionsSuite to pass with Java 11
[ https://issues.apache.org/jira/browse/SPARK-35527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351482#comment-17351482 ] Apache Spark commented on SPARK-35527: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32670 > Fix HiveExternalCatalogVersionsSuite to pass with Java 11 > - > > Key: SPARK-35527 > URL: https://issues.apache.org/jira/browse/SPARK-35527 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > I'm personally checking whether all the tests pass with Java 11 for the > current master and I found HiveExternalCatalogVersionsSuite fails. > The reason is that Spark 3.0.2 and 3.1.1 doesn't accept 2.3.8 as a hive > metastore version. > HiveExternalCatalogVersionsSuite downloads Spark releases from > https://dist.apache.org/repos/dist/release/spark/ and run test for each > release. The Spark releases are 3.0.2 and 3.1.1 for now. > With Java 11, the suite runs with a hive metastore version which corresponds > to the builtin Hive version and it's 2.3.8 for the current master. > But branch-3.0 and branch-3.1 doesn't accept 2.3.8, the suite with Java 11 > fails. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35527) Fix HiveExternalCatalogVersionsSuite to pass with Java 11
[ https://issues.apache.org/jira/browse/SPARK-35527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35527: Assignee: Apache Spark (was: Kousuke Saruta) > Fix HiveExternalCatalogVersionsSuite to pass with Java 11 > - > > Key: SPARK-35527 > URL: https://issues.apache.org/jira/browse/SPARK-35527 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > I'm personally checking whether all the tests pass with Java 11 for the > current master and I found HiveExternalCatalogVersionsSuite fails. > The reason is that Spark 3.0.2 and 3.1.1 doesn't accept 2.3.8 as a hive > metastore version. > HiveExternalCatalogVersionsSuite downloads Spark releases from > https://dist.apache.org/repos/dist/release/spark/ and run test for each > release. The Spark releases are 3.0.2 and 3.1.1 for now. > With Java 11, the suite runs with a hive metastore version which corresponds > to the builtin Hive version and it's 2.3.8 for the current master. > But branch-3.0 and branch-3.1 doesn't accept 2.3.8, the suite with Java 11 > fails. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35527) Fix HiveExternalCatalogVersionsSuite to pass with Java 11
Kousuke Saruta created SPARK-35527: -- Summary: Fix HiveExternalCatalogVersionsSuite to pass with Java 11 Key: SPARK-35527 URL: https://issues.apache.org/jira/browse/SPARK-35527 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta I'm personally checking whether all the tests pass with Java 11 for the current master and I found HiveExternalCatalogVersionsSuite fails. The reason is that Spark 3.0.2 and 3.1.1 doesn't accept 2.3.8 as a hive metastore version. HiveExternalCatalogVersionsSuite downloads Spark releases from https://dist.apache.org/repos/dist/release/spark/ and run test for each release. The Spark releases are 3.0.2 and 3.1.1 for now. With Java 11, the suite runs with a hive metastore version which corresponds to the builtin Hive version and it's 2.3.8 for the current master. But branch-3.0 and branch-3.1 doesn't accept 2.3.8, the suite with Java 11 fails. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35526) Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-35526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351481#comment-17351481 ] Apache Spark commented on SPARK-35526: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/32669 > Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13 > - > > Key: SPARK-35526 > URL: https://issues.apache.org/jira/browse/SPARK-35526 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Trivial > > Similar to SPARK-29291 and SPARK-33352, just to track Spark 3.2.0 > > There are still some compilation warnings about `procedure syntax is > deprecated`: > > {code:java} > [WARNING] [Warn] > /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return > type > [WARNING] [Warn] > /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:748: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `unregisterMergeResult`'s > return type > [WARNING] [Warn] > /spark/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala:223: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare > `testSimpleSpillingForAllCodecs`'s return type > [WARNING] [Warn] > /spark/mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala:53: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `runBLASBenchmark`'s return type > [WARNING] [Warn] > /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:110: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `assertEmptyRootPath`'s return > type > [WARNING] [Warn] > /spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:602: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare > `executeCTASWithNonEmptyLocation`'s return type > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35526) Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-35526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35526: Assignee: (was: Apache Spark) > Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13 > - > > Key: SPARK-35526 > URL: https://issues.apache.org/jira/browse/SPARK-35526 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Trivial > > Similar to SPARK-29291 and SPARK-33352, just to track Spark 3.2.0 > > There are still some compilation warnings about `procedure syntax is > deprecated`: > > {code:java} > [WARNING] [Warn] > /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return > type > [WARNING] [Warn] > /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:748: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `unregisterMergeResult`'s > return type > [WARNING] [Warn] > /spark/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala:223: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare > `testSimpleSpillingForAllCodecs`'s return type > [WARNING] [Warn] > /spark/mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala:53: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `runBLASBenchmark`'s return type > [WARNING] [Warn] > /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:110: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `assertEmptyRootPath`'s return > type > [WARNING] [Warn] > /spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:602: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare > `executeCTASWithNonEmptyLocation`'s return type > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35526) Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-35526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35526: Assignee: Apache Spark > Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13 > - > > Key: SPARK-35526 > URL: https://issues.apache.org/jira/browse/SPARK-35526 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Trivial > > Similar to SPARK-29291 and SPARK-33352, just to track Spark 3.2.0 > > There are still some compilation warnings about `procedure syntax is > deprecated`: > > {code:java} > [WARNING] [Warn] > /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return > type > [WARNING] [Warn] > /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:748: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `unregisterMergeResult`'s > return type > [WARNING] [Warn] > /spark/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala:223: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare > `testSimpleSpillingForAllCodecs`'s return type > [WARNING] [Warn] > /spark/mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala:53: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `runBLASBenchmark`'s return type > [WARNING] [Warn] > /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:110: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `assertEmptyRootPath`'s return > type > [WARNING] [Warn] > /spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:602: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare > `executeCTASWithNonEmptyLocation`'s return type > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35526) Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-35526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351480#comment-17351480 ] Apache Spark commented on SPARK-35526: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/32669 > Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13 > - > > Key: SPARK-35526 > URL: https://issues.apache.org/jira/browse/SPARK-35526 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Trivial > > Similar to SPARK-29291 and SPARK-33352, just to track Spark 3.2.0 > > There are still some compilation warnings about `procedure syntax is > deprecated`: > > {code:java} > [WARNING] [Warn] > /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return > type > [WARNING] [Warn] > /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:748: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `unregisterMergeResult`'s > return type > [WARNING] [Warn] > /spark/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala:223: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare > `testSimpleSpillingForAllCodecs`'s return type > [WARNING] [Warn] > /spark/mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala:53: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `runBLASBenchmark`'s return type > [WARNING] [Warn] > /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:110: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare `assertEmptyRootPath`'s return > type > [WARNING] [Warn] > /spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:602: > [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: > instead, add `: Unit =` to explicitly declare > `executeCTASWithNonEmptyLocation`'s return type > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34271) Use majorMinorPatchVersion for Hive version parsing
[ https://issues.apache.org/jira/browse/SPARK-34271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351478#comment-17351478 ] Apache Spark commented on SPARK-34271: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32668 > Use majorMinorPatchVersion for Hive version parsing > --- > > Key: SPARK-34271 > URL: https://issues.apache.org/jira/browse/SPARK-34271 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Minor > Fix For: 3.2.0 > > > Currently {{IsolatedClientLoader}} need to enumerate all Hive patch versions. > Therefore, whenever we upgrade Hive version we'd have to remember to update > the method. It would be better if we just check major & minor version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35526) Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13
Yang Jie created SPARK-35526: Summary: Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13 Key: SPARK-35526 URL: https://issues.apache.org/jira/browse/SPARK-35526 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Affects Versions: 3.2.0 Reporter: Yang Jie Similar to SPARK-29291 and SPARK-33352, just to track Spark 3.2.0 There are still some compilation warnings about `procedure syntax is deprecated`: {code:java} [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return type [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:748: [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `unregisterMergeResult`'s return type [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala:223: [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `testSimpleSpillingForAllCodecs`'s return type [WARNING] [Warn] /spark/mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala:53: [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `runBLASBenchmark`'s return type [WARNING] [Warn] /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:110: [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `assertEmptyRootPath`'s return type [WARNING] [Warn] /spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:602: [deprecation @ | origin= | version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `executeCTASWithNonEmptyLocation`'s return type {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index
[ https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34859: - Priority: Critical (was: Major) > Vectorized parquet reader needs synchronization among pages for column index > > > Key: SPARK-34859 > URL: https://issues.apache.org/jira/browse/SPARK-34859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Li Xian >Priority: Critical > Labels: correctness > Attachments: > part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet > > > the current implementation has a problem. the pages returned by > `readNextFilteredRowGroup` may not be aligned, some columns may have more > rows than others. > Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` > with `rowIndexes` to make sure that rows are aligned. > Currently `VectorizedParquetRecordReader` doesn't have such synchronizing > among pages from different columns. Using `readNextFilteredRowGroup` may > result in incorrect result. > > I have attache an example parquet file. This file is generated with > `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this > file is listed below. > row group 0 > > _1: INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED > [more]... > _2: INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 > ENC:PLAIN,BIT_PACKED [more]... > _1 TV=2000 RL=0 DL=0 > > > page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 2: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 3: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > _2 TV=2000 RL=0 DL=0 > > > page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:1000 > page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:1000 > > As you can see in the row group 0, column1 has 4 data pages each with 500 > values and column 2 has 2 data pages with 1000 values each. > If we want to filter the rows by values with _1 = 510 using columnindex, > parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 > of column _1 starts with row 500, and page 0 of column _2 starts with row 0, > and it will be incorrect if we simply read the two values as one row. > > As an example, If you try filter with _1 = 510 with column index on in > current version, it will give you the wrong result > +---+---+ > |_1 |_2 | > +---+---+ > |510|10 | > +---+---+ > And if turn columnindex off, you can get the correct result > +---+---+ > |_1 |_2 | > +---+---+ > |510|510| > +---+---+ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351473#comment-17351473 ] Yang Jie commented on SPARK-35496: -- ok [~dongjoon] > Upgrade Scala 2.13 to 2.13.7 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Major > > This issue aims to upgrade to Scala 2.13.7. > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). > However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 > which is different from both Scala 2.13.5 and Scala 3. > - https://github.com/scala/bug/issues/12403 > {code} > scala3-3.0.0:$ bin/scala > scala> Array.empty[Double].intersect(Array(0.0)) > val res0: Array[Double] = Array() > scala-2.13.6:$ bin/scala > Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292). > Type in expressions for evaluation. Or try :help. > scala> Array.empty[Double].intersect(Array(0.0)) > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D > ... 32 elided > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35378) Eagerly execute non-root Command
[ https://issues.apache.org/jira/browse/SPARK-35378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-35378: --- Summary: Eagerly execute non-root Command (was: Eagerly execute Command) > Eagerly execute non-root Command > > > Key: SPARK-35378 > URL: https://issues.apache.org/jira/browse/SPARK-35378 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Priority: Major > > Currently, Spark doesn't support LeafRunnableCommand as sub query. > Because the LeafRunnableCommand always output GenericInternalRow and some > node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExec) will convert > GenericInternalRow to UnsafeRow. So will causes error as follows: > {code:java} > java.lang.ClassCastException > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index
[ https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34859: - Labels: correctness (was: ) > Vectorized parquet reader needs synchronization among pages for column index > > > Key: SPARK-34859 > URL: https://issues.apache.org/jira/browse/SPARK-34859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Li Xian >Priority: Major > Labels: correctness > Attachments: > part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet > > > the current implementation has a problem. the pages returned by > `readNextFilteredRowGroup` may not be aligned, some columns may have more > rows than others. > Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` > with `rowIndexes` to make sure that rows are aligned. > Currently `VectorizedParquetRecordReader` doesn't have such synchronizing > among pages from different columns. Using `readNextFilteredRowGroup` may > result in incorrect result. > > I have attache an example parquet file. This file is generated with > `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this > file is listed below. > row group 0 > > _1: INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED > [more]... > _2: INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 > ENC:PLAIN,BIT_PACKED [more]... > _1 TV=2000 RL=0 DL=0 > > > page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 2: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 3: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > _2 TV=2000 RL=0 DL=0 > > > page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:1000 > page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:1000 > > As you can see in the row group 0, column1 has 4 data pages each with 500 > values and column 2 has 2 data pages with 1000 values each. > If we want to filter the rows by values with _1 = 510 using columnindex, > parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 > of column _1 starts with row 500, and page 0 of column _2 starts with row 0, > and it will be incorrect if we simply read the two values as one row. > > As an example, If you try filter with _1 = 510 with column index on in > current version, it will give you the wrong result > +---+---+ > |_1 |_2 | > +---+---+ > |510|10 | > +---+---+ > And if turn columnindex off, you can get the correct result > +---+---+ > |_1 |_2 | > +---+---+ > |510|510| > +---+---+ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351464#comment-17351464 ] dc-heros edited comment on SPARK-30696 at 5/26/21, 3:54 AM: fromUTCtime and toUTCtime produced wrong result on Daylight Saving Time changes days For example, in LA in 1960, timezone switch from UTC-7h to UTC-8h at 2AM in 1960-09-25 but previous version have the cutoff at 8AM Because of this, for example 1960-09-25 1:30:00 in LA can be equal to both 1960-09-25 08:30:00 and 1960-09-25 09:30:00 and the fromUTCtime just pick 1 of them, so there just wrong on the cutoff time in those function Could you edit the description [~maxgekk] was (Author: dc-heros): fromUTCtime and toUTCtime produced wrong result on Daylight Saving Time changes days For example, in LA in 1960, timezone switch from UTC-7h to UTC-8h at 2AM in 1960-09-25 but previous version have the cutoff at 8AM Because of this, for example 1960-09-25 1:30:00 in LA can be equal to both 1960-09-25 08:30:00 and 1960-09-25 09:30:00, so there just wrong on the cutoff time from those function Could you edit the description [~maxgekk] > Wrong result of the combination of from_utc_timestamp and to_utc_timestamp > -- > > Key: SPARK-30696 > URL: https://issues.apache.org/jira/browse/SPARK-30696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Max Gekk >Priority: Major > > Applying to_utc_timestamp() to results of from_utc_timestamp() should return > the original timestamp in the same time zone. In the range of 100 years, the > combination of functions returns wrong results 280 times out of 1753200: > {code:java} > scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100 > SECS_PER_YEAR: Long = 31557600 > scala> val SECS_PER_MINUTE = 60L > SECS_PER_MINUTE: Long = 60 > scala> val tz = "America/Los_Angeles" > tz: String = America/Los_Angeles > scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * > SECS_PER_MINUTE) > df: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> val diff = > df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), > tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0) > warning: there was one deprecation warning; re-run with -deprecation for > details > diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint] > scala> diff.count > res14: Long = 280 > scala> df.count > res15: Long = 1753200 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351464#comment-17351464 ] dc-heros commented on SPARK-30696: -- fromUTCtime and toUTCtime produced wrong result on Daylight Saving Time changes days For example, in LA in 1960, timezone switch from UTC-7h to UTC-8h at 2AM in 1960-09-25 but previous version have the cutoff at 8AM Because of this, for example 1960-09-25 1:30:00 in LA can be equal to both 1960-09-25 08:30:00 and 1960-09-25 09:30:00, so there just wrong on the cutoff time from those function Could you edit the description [~maxgekk] > Wrong result of the combination of from_utc_timestamp and to_utc_timestamp > -- > > Key: SPARK-30696 > URL: https://issues.apache.org/jira/browse/SPARK-30696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Max Gekk >Priority: Major > > Applying to_utc_timestamp() to results of from_utc_timestamp() should return > the original timestamp in the same time zone. In the range of 100 years, the > combination of functions returns wrong results 280 times out of 1753200: > {code:java} > scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100 > SECS_PER_YEAR: Long = 31557600 > scala> val SECS_PER_MINUTE = 60L > SECS_PER_MINUTE: Long = 60 > scala> val tz = "America/Los_Angeles" > tz: String = America/Los_Angeles > scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * > SECS_PER_MINUTE) > df: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> val diff = > df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), > tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0) > warning: there was one deprecation warning; re-run with -deprecation for > details > diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint] > scala> diff.count > res14: Long = 280 > scala> df.count > res15: Long = 1753200 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-30696: Assignee: (was: Apache Spark) > Wrong result of the combination of from_utc_timestamp and to_utc_timestamp > -- > > Key: SPARK-30696 > URL: https://issues.apache.org/jira/browse/SPARK-30696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Max Gekk >Priority: Major > > Applying to_utc_timestamp() to results of from_utc_timestamp() should return > the original timestamp in the same time zone. In the range of 100 years, the > combination of functions returns wrong results 280 times out of 1753200: > {code:java} > scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100 > SECS_PER_YEAR: Long = 31557600 > scala> val SECS_PER_MINUTE = 60L > SECS_PER_MINUTE: Long = 60 > scala> val tz = "America/Los_Angeles" > tz: String = America/Los_Angeles > scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * > SECS_PER_MINUTE) > df: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> val diff = > df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), > tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0) > warning: there was one deprecation warning; re-run with -deprecation for > details > diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint] > scala> diff.count > res14: Long = 280 > scala> df.count > res15: Long = 1753200 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-30696: Assignee: Apache Spark > Wrong result of the combination of from_utc_timestamp and to_utc_timestamp > -- > > Key: SPARK-30696 > URL: https://issues.apache.org/jira/browse/SPARK-30696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Applying to_utc_timestamp() to results of from_utc_timestamp() should return > the original timestamp in the same time zone. In the range of 100 years, the > combination of functions returns wrong results 280 times out of 1753200: > {code:java} > scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100 > SECS_PER_YEAR: Long = 31557600 > scala> val SECS_PER_MINUTE = 60L > SECS_PER_MINUTE: Long = 60 > scala> val tz = "America/Los_Angeles" > tz: String = America/Los_Angeles > scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * > SECS_PER_MINUTE) > df: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> val diff = > df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), > tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0) > warning: there was one deprecation warning; re-run with -deprecation for > details > diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint] > scala> diff.count > res14: Long = 280 > scala> df.count > res15: Long = 1753200 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351463#comment-17351463 ] Apache Spark commented on SPARK-30696: -- User 'dgd-contributor' has created a pull request for this issue: https://github.com/apache/spark/pull/32666 > Wrong result of the combination of from_utc_timestamp and to_utc_timestamp > -- > > Key: SPARK-30696 > URL: https://issues.apache.org/jira/browse/SPARK-30696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Max Gekk >Priority: Major > > Applying to_utc_timestamp() to results of from_utc_timestamp() should return > the original timestamp in the same time zone. In the range of 100 years, the > combination of functions returns wrong results 280 times out of 1753200: > {code:java} > scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100 > SECS_PER_YEAR: Long = 31557600 > scala> val SECS_PER_MINUTE = 60L > SECS_PER_MINUTE: Long = 60 > scala> val tz = "America/Los_Angeles" > tz: String = America/Los_Angeles > scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * > SECS_PER_MINUTE) > df: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> val diff = > df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), > tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0) > warning: there was one deprecation warning; re-run with -deprecation for > details > diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint] > scala> diff.count > res14: Long = 280 > scala> df.count > res15: Long = 1753200 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35512) pyspark partitionBy may encounter 'OverflowError: cannot convert float infinity to integer'
[ https://issues.apache.org/jira/browse/SPARK-35512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35512: Assignee: Apache Spark > pyspark partitionBy may encounter 'OverflowError: cannot convert float > infinity to integer' > --- > > Key: SPARK-35512 > URL: https://issues.apache.org/jira/browse/SPARK-35512 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.2 >Reporter: nolan liu >Assignee: Apache Spark >Priority: Major > > h2. Code sample > {code:python} > # pyspark > rdd = ... > new_rdd = rdd.partitionBy(64){code} > An OverflowError is raised when there is a {color:#ff}big input > file{color} and {color:#ff}executor memory{color} is not big enough. > h2. Error information: > > {code:java} > TaskSetManager: Lost task 312.0 in stage 1.0 (TID 748, 11.4.137.5, executor > 83): org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/opt/spark3/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main > process() > File "/opt/spark3/python/lib/pyspark.zip/pyspark/worker.py", line 597, in > process > serializer.dump_stream(out_iter, outfile) > File "/opt/spark3/python/lib/pyspark.zip/pyspark/serializers.py", line 141, > in dump_stream > for obj in iterator: > File "/opt/spark3/python/lib/pyspark.zip/pyspark/rdd.py", line 1899, in > add_shuffle_key > OverflowError: cannot convert float infinity to integer > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503) > at > org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638) > at > org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1209) > at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1215) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:156) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:130) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1420) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > h2. Spark code > [https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L2072] > {code:python} > for k, v in iterator: > buckets[partitionFunc(k) % numPartitions].append((k, v)) > c += 1# check used memory and avg size of > chunk of objects > if (c % 1000 == 0 and get_used_memory() > limit > or c > batch): > n, size = len(buckets), 0 > for split in list(buckets.keys()): > yield pack_long(split) > d = outputSerializer.dumps(buckets[split]) > del buckets[split] > yield d > size += len(d)avg = int(size / n) > >> 20 > # let 1M < avg < 10M > if avg < 1: > batch *= 1.5 > elif avg > 10: > batch = max(int(batch / 1.5), 1) > c = 0 > {code} > h2. Explanation > *`batch`* may grow infinity when `*get_used_memory() > limit*` is true, then > overflow at `*max(int(batch / 1.5), 1)*` > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35512) pyspark partitionBy may encounter 'OverflowError: cannot convert float infinity to integer'
[ https://issues.apache.org/jira/browse/SPARK-35512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35512: Assignee: (was: Apache Spark) > pyspark partitionBy may encounter 'OverflowError: cannot convert float > infinity to integer' > --- > > Key: SPARK-35512 > URL: https://issues.apache.org/jira/browse/SPARK-35512 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.2 >Reporter: nolan liu >Priority: Major > > h2. Code sample > {code:python} > # pyspark > rdd = ... > new_rdd = rdd.partitionBy(64){code} > An OverflowError is raised when there is a {color:#ff}big input > file{color} and {color:#ff}executor memory{color} is not big enough. > h2. Error information: > > {code:java} > TaskSetManager: Lost task 312.0 in stage 1.0 (TID 748, 11.4.137.5, executor > 83): org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/opt/spark3/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main > process() > File "/opt/spark3/python/lib/pyspark.zip/pyspark/worker.py", line 597, in > process > serializer.dump_stream(out_iter, outfile) > File "/opt/spark3/python/lib/pyspark.zip/pyspark/serializers.py", line 141, > in dump_stream > for obj in iterator: > File "/opt/spark3/python/lib/pyspark.zip/pyspark/rdd.py", line 1899, in > add_shuffle_key > OverflowError: cannot convert float infinity to integer > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503) > at > org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638) > at > org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1209) > at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1215) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:156) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:130) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1420) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > h2. Spark code > [https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L2072] > {code:python} > for k, v in iterator: > buckets[partitionFunc(k) % numPartitions].append((k, v)) > c += 1# check used memory and avg size of > chunk of objects > if (c % 1000 == 0 and get_used_memory() > limit > or c > batch): > n, size = len(buckets), 0 > for split in list(buckets.keys()): > yield pack_long(split) > d = outputSerializer.dumps(buckets[split]) > del buckets[split] > yield d > size += len(d)avg = int(size / n) > >> 20 > # let 1M < avg < 10M > if avg < 1: > batch *= 1.5 > elif avg > 10: > batch = max(int(batch / 1.5), 1) > c = 0 > {code} > h2. Explanation > *`batch`* may grow infinity when `*get_used_memory() > limit*` is true, then > overflow at `*max(int(batch / 1.5), 1)*` > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35512) pyspark partitionBy may encounter 'OverflowError: cannot convert float infinity to integer'
[ https://issues.apache.org/jira/browse/SPARK-35512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351461#comment-17351461 ] Apache Spark commented on SPARK-35512: -- User 'nolanliou' has created a pull request for this issue: https://github.com/apache/spark/pull/32667 > pyspark partitionBy may encounter 'OverflowError: cannot convert float > infinity to integer' > --- > > Key: SPARK-35512 > URL: https://issues.apache.org/jira/browse/SPARK-35512 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.2 >Reporter: nolan liu >Priority: Major > > h2. Code sample > {code:python} > # pyspark > rdd = ... > new_rdd = rdd.partitionBy(64){code} > An OverflowError is raised when there is a {color:#ff}big input > file{color} and {color:#ff}executor memory{color} is not big enough. > h2. Error information: > > {code:java} > TaskSetManager: Lost task 312.0 in stage 1.0 (TID 748, 11.4.137.5, executor > 83): org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/opt/spark3/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main > process() > File "/opt/spark3/python/lib/pyspark.zip/pyspark/worker.py", line 597, in > process > serializer.dump_stream(out_iter, outfile) > File "/opt/spark3/python/lib/pyspark.zip/pyspark/serializers.py", line 141, > in dump_stream > for obj in iterator: > File "/opt/spark3/python/lib/pyspark.zip/pyspark/rdd.py", line 1899, in > add_shuffle_key > OverflowError: cannot convert float infinity to integer > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503) > at > org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638) > at > org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1209) > at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1215) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:156) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:130) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1420) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > h2. Spark code > [https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L2072] > {code:python} > for k, v in iterator: > buckets[partitionFunc(k) % numPartitions].append((k, v)) > c += 1# check used memory and avg size of > chunk of objects > if (c % 1000 == 0 and get_used_memory() > limit > or c > batch): > n, size = len(buckets), 0 > for split in list(buckets.keys()): > yield pack_long(split) > d = outputSerializer.dumps(buckets[split]) > del buckets[split] > yield d > size += len(d)avg = int(size / n) > >> 20 > # let 1M < avg < 10M > if avg < 1: > batch *= 1.5 > elif avg > 10: > batch = max(int(batch / 1.5), 1) > c = 0 > {code} > h2. Explanation > *`batch`* may grow infinity when `*get_used_memory() > limit*` is true, then > overflow at `*max(int(batch / 1.5), 1)*` > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index
[ https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Xian updated SPARK-34859: Description: the current implementation has a problem. the pages returned by `readNextFilteredRowGroup` may not be aligned, some columns may have more rows than others. Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` with `rowIndexes` to make sure that rows are aligned. Currently `VectorizedParquetRecordReader` doesn't have such synchronizing among pages from different columns. Using `readNextFilteredRowGroup` may result in incorrect result. I have attache an example parquet file. This file is generated with `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this file is listed below. row group 0 _1: INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED [more]... _2: INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 ENC:PLAIN,BIT_PACKED [more]... _1 TV=2000 RL=0 DL=0 page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:500 page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:500 page 2: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:500 page 3: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:500 _2 TV=2000 RL=0 DL=0 page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:1000 page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:1000 As you can see in the row group 0, column1 has 4 data pages each with 500 values and column 2 has 2 data pages with 1000 values each. If we want to filter the rows by values with _1 = 510 using columnindex, parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 of column _1 starts with row 500, and page 0 of column _2 starts with row 0, and it will be incorrect if we simply read the two values as one row. As an example, If you try filter with _1 = 510 with column index on in current version, it will give you the wrong result +---+---+ |_1 |_2 | +---+---+ |510|10 | +---+---+ And if turn columnindex off, you can get the correct result +---+---+ |_1 |_2 | +---+---+ |510|510| +---+---+ was: the current implementation has a problem. the pages returned by `readNextFilteredRowGroup` may not be aligned, some columns may have more rows than others. Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` with `rowIndexes` to make sure that rows are aligned. Currently `VectorizedParquetRecordReader` doesn't have such synchronizing among pages from different columns. Using `readNextFilteredRowGroup` may result in incorrect result. > Vectorized parquet reader needs synchronization among pages for column index > > > Key: SPARK-34859 > URL: https://issues.apache.org/jira/browse/SPARK-34859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Li Xian >Priority: Major > Attachments: > part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet > > > the current implementation has a problem. the pages returned by > `readNextFilteredRowGroup` may not be aligned, some columns may have more > rows than others. > Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` > with `rowIndexes` to make sure that rows are aligned. > Currently `VectorizedParquetRecordReader` doesn't have such synchronizing > among pages from different columns. Using `readNextFilteredRowGroup` may > result in incorrect result. > > I have attache an example parquet file. This file is generated with > `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this > file is listed below. > row group 0 > > _1: INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED > [more]... > _2: INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 > ENC:PLAIN,BIT_PACKED [more]... > _1 TV=2000 RL=0 DL=0 > > > page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 2: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 3: DLE:BIT_PACKED RLE:BIT_PACKED
[jira] [Updated] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index
[ https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Xian updated SPARK-34859: Attachment: part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet > Vectorized parquet reader needs synchronization among pages for column index > > > Key: SPARK-34859 > URL: https://issues.apache.org/jira/browse/SPARK-34859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Li Xian >Priority: Major > Attachments: > part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet > > > the current implementation has a problem. the pages returned by > `readNextFilteredRowGroup` may not be aligned, some columns may have more > rows than others. > Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` > with `rowIndexes` to make sure that rows are aligned. > Currently `VectorizedParquetRecordReader` doesn't have such synchronizing > among pages from different columns. Using `readNextFilteredRowGroup` may > result in incorrect result. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32195) Standardize warning types and messages
[ https://issues.apache.org/jira/browse/SPARK-32195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32195. -- Resolution: Done > Standardize warning types and messages > -- > > Key: SPARK-32195 > URL: https://issues.apache.org/jira/browse/SPARK-32195 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > > Currently PySpark uses a somewhat inconsistent warning type and message such > as UserWarning. We should standardize it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32194) Standardize exceptions in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32194. -- Fix Version/s: 3.2.0 Assignee: Hyukjin Kwon Resolution: Fixed Fixed in https://github.com/apache/spark/pull/32650 > Standardize exceptions in PySpark > - > > Key: SPARK-32194 > URL: https://issues.apache.org/jira/browse/SPARK-32194 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.2.0 > > > Currently, PySpark throws {{Exception}} or just {{RuntimeException}} in many > cases. We should standardize them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35496: Assignee: Apache Spark > Upgrade Scala 2.13 to 2.13.7 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > This issue aims to upgrade to Scala 2.13.7. > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). > However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 > which is different from both Scala 2.13.5 and Scala 3. > - https://github.com/scala/bug/issues/12403 > {code} > scala3-3.0.0:$ bin/scala > scala> Array.empty[Double].intersect(Array(0.0)) > val res0: Array[Double] = Array() > scala-2.13.6:$ bin/scala > Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292). > Type in expressions for evaluation. Or try :help. > scala> Array.empty[Double].intersect(Array(0.0)) > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D > ... 32 elided > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35496: Assignee: (was: Apache Spark) > Upgrade Scala 2.13 to 2.13.7 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Major > > This issue aims to upgrade to Scala 2.13.7. > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). > However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 > which is different from both Scala 2.13.5 and Scala 3. > - https://github.com/scala/bug/issues/12403 > {code} > scala3-3.0.0:$ bin/scala > scala> Array.empty[Double].intersect(Array(0.0)) > val res0: Array[Double] = Array() > scala-2.13.6:$ bin/scala > Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292). > Type in expressions for evaluation. Or try :help. > scala> Array.empty[Double].intersect(Array(0.0)) > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D > ... 32 elided > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35525) Define UDTs in schemas using string format
Julian Shalaby created SPARK-35525: -- Summary: Define UDTs in schemas using string format Key: SPARK-35525 URL: https://issues.apache.org/jira/browse/SPARK-35525 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.1 Reporter: Julian Shalaby In PySpark where UDTs are public in 3.1.1 for example, you can define a schema using UDTs in the format: schema = StructType([StructField("Stuff", MyUDT())]) but the format schema = "Stuff MyUDT" does not work. UDTs are officially being made public again in 3.2.0 for Scala, so this issue is pretty important now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35524) Pass objects as parameters to SparkSQL UDFs
[ https://issues.apache.org/jira/browse/SPARK-35524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julian Shalaby updated SPARK-35524: --- Shepherd: (was: Sean R. Owen) > Pass objects as parameters to SparkSQL UDFs > --- > > Key: SPARK-35524 > URL: https://issues.apache.org/jira/browse/SPARK-35524 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core, SQL >Affects Versions: 3.1.1 >Reporter: Julian Shalaby >Priority: Major > Labels: UDF, UDT, spark, spark-sql > > You can pass class objects directly to UDFs using the UDF format: > df.select("*").filter(myFunc(classObj)(col("colName"))) > but the format: > """SELECT * FROM view WHERE myFunc(classObj, "colName")""" > or > """SELECT * FROM view WHERE myFunc(classObj)("colName")""" > does not work. This would be a very useful feature to have, especially being > that UDTs are being made public again in 3.2.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35524) Pass objects as parameters to SparkSQL UDFs
Julian Shalaby created SPARK-35524: -- Summary: Pass objects as parameters to SparkSQL UDFs Key: SPARK-35524 URL: https://issues.apache.org/jira/browse/SPARK-35524 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core, SQL Affects Versions: 3.1.1 Reporter: Julian Shalaby You can pass class objects directly to UDFs using the UDF format: df.select("*").filter(myFunc(classObj)(col("colName"))) but the format: """SELECT * FROM view WHERE myFunc(classObj, "colName")""" or """SELECT * FROM view WHERE myFunc(classObj)("colName")""" does not work. This would be a very useful feature to have, especially being that UDTs are being made public again in 3.2.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35455) Enhance EliminateUnnecessaryJoin
[ https://issues.apache.org/jira/browse/SPARK-35455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-35455: -- Priority: Major (was: Minor) > Enhance EliminateUnnecessaryJoin > > > Key: SPARK-35455 > URL: https://issues.apache.org/jira/browse/SPARK-35455 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.2.0 > > > Make EliminateUnnecessaryJoin support to eliminate outer join and multi-join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35455) Enhance EliminateUnnecessaryJoin
[ https://issues.apache.org/jira/browse/SPARK-35455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-35455: -- Parent: SPARK-33828 Issue Type: Sub-task (was: Improvement) > Enhance EliminateUnnecessaryJoin > > > Key: SPARK-35455 > URL: https://issues.apache.org/jira/browse/SPARK-35455 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Fix For: 3.2.0 > > > Make EliminateUnnecessaryJoin support to eliminate outer join and multi-join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351426#comment-17351426 ] Xianghao Lu commented on SPARK-35332: - Great, thank you very much for your work [~ulysses] > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Assignee: XiDuo You >Priority: Major > Fix For: 3.2.0 > > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35504) count distinct asterisk
[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351425#comment-17351425 ] Hyukjin Kwon commented on SPARK-35504: -- Thanks for confirmation and investigation man > count distinct asterisk > > > Key: SPARK-35504 > URL: https://issues.apache.org/jira/browse/SPARK-35504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: {code:java} > uname -a > Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > {code} > > {code:java} > lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.4 LTS > Release: 18.04 > Codename: bionic > {code} > > {code:java} > /opt/spark/bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 > Branch HEAD > Compiled by user ubuntu on 2020-06-06T13:05:28Z > Revision 3fdfce3120f307147244e5eaf46d61419a723d50 > Url https://gitbox.apache.org/repos/asf/spark.git > Type --help for more information. > {code} > {code:java} > lscpu > Architecture:x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s):1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping:7 > CPU MHz: 3602.011 > BogoMIPS:6000.01 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache:1024K > L3 cache:36608K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf > tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms > invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke > {code} > >Reporter: Nikolay Sokolov >Priority: Minor > Attachments: SPARK-35504_first_query_plan.log, > SPARK-35504_second_query_plan.log > > > Hi everyone, > I hope you're well! > > Today I came across a very interesting case when the result of the execution > of the algorithm for counting unique rows differs depending on the form > (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. > I still can't figure out on my own if this is a bug or a feature and I would > like to share what I found. > > I run Spark SQL queries through the Thrift (and not only) connecting to the > Spark cluster. I use the DBeaver app to execute Spark SQL queries. > > So, I have two identical Spark SQL queries from an algorithmic point of view > that return different results. > > The first query: > {code:sql} > select count(distinct *) unique_amt from storage_datamart.olympiads > ; -- Rows: 13437678 > {code} > > The second query: > {code:sql} > select count(*) from (select distinct * from storage_datamart.olympiads) > ; -- Rows: 36901430 > {code} > > The result of the two queries is different. (But it must be the same, right!?) > {code:sql} > select 'The first query' description, count(distinct *) unique_amt from > storage_datamart.olympiads > union all > select 'The second query', count(*) from (select distinct * from > storage_datamart.olympiads) > ; > {code} > > The result of the above query is the following: > {code:java} > The first query13437678 > The second query 36901430 > {code} > > I can easily calculate the unique number of rows in the table: > {code:sql} > select count(*) from ( > select student_id, olympiad_id, tour, grade > from storage_datamart.olympiads >group by student_id, olympiad_id, tour, grade > having count(*) = 1 > ) > ; -- Rows: 36901365 > {code} > > The table DDL is the following: > {code:sql} > CREATE TABLE `storage_datamart`.`olympiads` ( > `ptn_date` DATE, > `student_id` BIGINT, > `olympiad_id` STRING, > `grade` BIGINT, > `grade_type` STRING, > `tour` STRING, > `created_at` TIMESTAMP, > `created_at_local` TIMESTAMP, > `olympiad_num` BIGINT, > `olympiad_name` STRING, > `subject` STRING,
[jira] [Created] (SPARK-35523) Fix the default value properly in Data Source Options page
Haejoon Lee created SPARK-35523: --- Summary: Fix the default value properly in Data Source Options page Key: SPARK-35523 URL: https://issues.apache.org/jira/browse/SPARK-35523 Project: Spark Issue Type: Sub-task Components: docs Affects Versions: 3.2.0 Reporter: Haejoon Lee The default value in Data Source Options page following the Python API documents, but we'd better to follow the Scaladoc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35521) List Python 3.8 installed libraries in build_and_test workflow
[ https://issues.apache.org/jira/browse/SPARK-35521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35521. -- Resolution: Duplicate > List Python 3.8 installed libraries in build_and_test workflow > -- > > Key: SPARK-35521 > URL: https://issues.apache.org/jira/browse/SPARK-35521 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > In the build_and_test workflow, tests are ran against both Python 3.6 and > Python 3.8. However, only libraries installed in Python 3.6 are listed. We > should list Python 3.8's installed libraries as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35506) Run tests with Python 3.9 in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-35506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35506. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32657 [https://github.com/apache/spark/pull/32657] > Run tests with Python 3.9 in GitHub Actions > --- > > Key: SPARK-35506 > URL: https://issues.apache.org/jira/browse/SPARK-35506 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.2.0 > > > We're currently running PySpark tests with Python 3.8. We should run it with > Python 3.9 to verify the latest python support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35522) Introduce BinaryOps for BinaryType
[ https://issues.apache.org/jira/browse/SPARK-35522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35522: Assignee: (was: Apache Spark) > Introduce BinaryOps for BinaryType > -- > > Key: SPARK-35522 > URL: https://issues.apache.org/jira/browse/SPARK-35522 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > BinaryType, which represents byte sequence values in Spark, doesn't support > data-type-based operations yet. We are going to introduce BinaryOps for it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35522) Introduce BinaryOps for BinaryType
[ https://issues.apache.org/jira/browse/SPARK-35522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35522: Assignee: Apache Spark > Introduce BinaryOps for BinaryType > -- > > Key: SPARK-35522 > URL: https://issues.apache.org/jira/browse/SPARK-35522 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > BinaryType, which represents byte sequence values in Spark, doesn't support > data-type-based operations yet. We are going to introduce BinaryOps for it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35522) Introduce BinaryOps for BinaryType
[ https://issues.apache.org/jira/browse/SPARK-35522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351373#comment-17351373 ] Apache Spark commented on SPARK-35522: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/32665 > Introduce BinaryOps for BinaryType > -- > > Key: SPARK-35522 > URL: https://issues.apache.org/jira/browse/SPARK-35522 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > BinaryType, which represents byte sequence values in Spark, doesn't support > data-type-based operations yet. We are going to introduce BinaryOps for it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35522) Introduce BinaryOps for BinaryType
Xinrong Meng created SPARK-35522: Summary: Introduce BinaryOps for BinaryType Key: SPARK-35522 URL: https://issues.apache.org/jira/browse/SPARK-35522 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.2.0 Reporter: Xinrong Meng BinaryType, which represents byte sequence values in Spark, doesn't support data-type-based operations yet. We are going to introduce BinaryOps for it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35516) Storage UI tab Storage Level tool tip correction
[ https://issues.apache.org/jira/browse/SPARK-35516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351346#comment-17351346 ] Lidiya Nixon commented on SPARK-35516: -- I have raised a fix for this https://github.com/apache/spark/pull/32664 > Storage UI tab Storage Level tool tip correction > > > Key: SPARK-35516 > URL: https://issues.apache.org/jira/browse/SPARK-35516 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.1 >Reporter: jobit mathew >Priority: Trivial > > Storage UI tab Storage Level tool tip correction required. > || > | || > | || > | |Storage Level| > | || > please change *andreplication * to *and replication* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35516) Storage UI tab Storage Level tool tip correction
[ https://issues.apache.org/jira/browse/SPARK-35516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351344#comment-17351344 ] Apache Spark commented on SPARK-35516: -- User 'lidiyag' has created a pull request for this issue: https://github.com/apache/spark/pull/32664 > Storage UI tab Storage Level tool tip correction > > > Key: SPARK-35516 > URL: https://issues.apache.org/jira/browse/SPARK-35516 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.1 >Reporter: jobit mathew >Priority: Trivial > > Storage UI tab Storage Level tool tip correction required. > || > | || > | || > | |Storage Level| > | || > please change *andreplication * to *and replication* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35516) Storage UI tab Storage Level tool tip correction
[ https://issues.apache.org/jira/browse/SPARK-35516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35516: Assignee: Apache Spark > Storage UI tab Storage Level tool tip correction > > > Key: SPARK-35516 > URL: https://issues.apache.org/jira/browse/SPARK-35516 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.1 >Reporter: jobit mathew >Assignee: Apache Spark >Priority: Trivial > > Storage UI tab Storage Level tool tip correction required. > || > | || > | || > | |Storage Level| > | || > please change *andreplication * to *and replication* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35516) Storage UI tab Storage Level tool tip correction
[ https://issues.apache.org/jira/browse/SPARK-35516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35516: Assignee: (was: Apache Spark) > Storage UI tab Storage Level tool tip correction > > > Key: SPARK-35516 > URL: https://issues.apache.org/jira/browse/SPARK-35516 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.1 >Reporter: jobit mathew >Priority: Trivial > > Storage UI tab Storage Level tool tip correction required. > || > | || > | || > | |Storage Level| > | || > please change *andreplication * to *and replication* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35513) Upgrade joda-time to 2.10.10
[ https://issues.apache.org/jira/browse/SPARK-35513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35513. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32661 [https://github.com/apache/spark/pull/32661] > Upgrade joda-time to 2.10.10 > > > Key: SPARK-35513 > URL: https://issues.apache.org/jira/browse/SPARK-35513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Vinod KC >Assignee: Vinod KC >Priority: Minor > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35513) Upgrade joda-time to 2.10.10
[ https://issues.apache.org/jira/browse/SPARK-35513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-35513: - Assignee: Vinod KC > Upgrade joda-time to 2.10.10 > > > Key: SPARK-35513 > URL: https://issues.apache.org/jira/browse/SPARK-35513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Vinod KC >Assignee: Vinod KC >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35505) Remove APIs that have been deprecated in Koalas.
[ https://issues.apache.org/jira/browse/SPARK-35505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-35505. --- Fix Version/s: 3.2.0 Assignee: Takuya Ueshin Resolution: Fixed Issue resolved by pull request 32656 https://github.com/apache/spark/pull/32656 > Remove APIs that have been deprecated in Koalas. > > > Key: SPARK-35505 > URL: https://issues.apache.org/jira/browse/SPARK-35505 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.2.0 > > > There are some APIs that have been deprecated in Koalas. We shouldn't have > those in pandas APIs on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35514) Automatically update version index of DocSearch via release-tag.sh
[ https://issues.apache.org/jira/browse/SPARK-35514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35514: -- Fix Version/s: (was: 3.1.2) 3.1.3 > Automatically update version index of DocSearch via release-tag.sh > -- > > Key: SPARK-35514 > URL: https://issues.apache.org/jira/browse/SPARK-35514 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.2.0, 3.1.3 > > > Automatically update version index of DocSearch via release-tag.sh for > releasing new documentation site, instead of the current manual update. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35521) List Python 3.8 installed libraries in build_and_test workflow
[ https://issues.apache.org/jira/browse/SPARK-35521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351218#comment-17351218 ] Apache Spark commented on SPARK-35521: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/32663 > List Python 3.8 installed libraries in build_and_test workflow > -- > > Key: SPARK-35521 > URL: https://issues.apache.org/jira/browse/SPARK-35521 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > In the build_and_test workflow, tests are ran against both Python 3.6 and > Python 3.8. However, only libraries installed in Python 3.6 are listed. We > should list Python 3.8's installed libraries as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35521) List Python 3.8 installed libraries in build_and_test workflow
[ https://issues.apache.org/jira/browse/SPARK-35521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35521: Assignee: Apache Spark > List Python 3.8 installed libraries in build_and_test workflow > -- > > Key: SPARK-35521 > URL: https://issues.apache.org/jira/browse/SPARK-35521 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > Fix For: 3.2.0 > > > In the build_and_test workflow, tests are ran against both Python 3.6 and > Python 3.8. However, only libraries installed in Python 3.6 are listed. We > should list Python 3.8's installed libraries as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35521) List Python 3.8 installed libraries in build_and_test workflow
[ https://issues.apache.org/jira/browse/SPARK-35521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35521: Assignee: (was: Apache Spark) > List Python 3.8 installed libraries in build_and_test workflow > -- > > Key: SPARK-35521 > URL: https://issues.apache.org/jira/browse/SPARK-35521 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > In the build_and_test workflow, tests are ran against both Python 3.6 and > Python 3.8. However, only libraries installed in Python 3.6 are listed. We > should list Python 3.8's installed libraries as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35341) Introduce ExtentionDtypeOps
[ https://issues.apache.org/jira/browse/SPARK-35341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35341: - Description: {{Now ___and, __or,_ ___rand, and __ror___ are not data type based.}} So we would like to introduce these operators to the DataTypeOps class. extension_dtypes process these operators differently from the rest of the types. So we would also introduce ExtentionDtypeOps. ExtentionDtypeOps would be helpful for other data-type-based operations, for example, to/from pandas conversion as well. was: Now __and__, __or__,_ _rand__, and __ror__ are not data type based. So we would like to introduce __and__, __or__,_ _rand__, and __ror__ to each DataTypeOps subclass. extension_dtypes process __and__, __or__,_ _rand__, and __ror__ differently from the rest of types. So we would also introduce ExtentionDtypeOps. > Introduce ExtentionDtypeOps > --- > > Key: SPARK-35341 > URL: https://issues.apache.org/jira/browse/SPARK-35341 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > {{Now ___and, __or,_ ___rand, and __ror___ are not data type > based.}} > So we would like to introduce these operators to the DataTypeOps class. > extension_dtypes process these operators differently from the rest of the > types. > So we would also introduce ExtentionDtypeOps. > ExtentionDtypeOps would be helpful for other data-type-based operations, for > example, to/from pandas conversion as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35521) List Python 3.8 installed libraries in build_and_test workflow
Xinrong Meng created SPARK-35521: Summary: List Python 3.8 installed libraries in build_and_test workflow Key: SPARK-35521 URL: https://issues.apache.org/jira/browse/SPARK-35521 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 3.2.0 Reporter: Xinrong Meng Fix For: 3.2.0 In the build_and_test workflow, tests are ran against both Python 3.6 and Python 3.8. However, only libraries installed in Python 3.6 are listed. We should list Python 3.8's installed libraries as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35496: -- Description: This issue aims to upgrade to Scala 2.13.7. Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 which is different from both Scala 2.13.5 and Scala 3. - https://github.com/scala/bug/issues/12403 {code} scala3-3.0.0:$ bin/scala scala> Array.empty[Double].intersect(Array(0.0)) val res0: Array[Double] = Array() scala-2.13.6:$ bin/scala Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292). Type in expressions for evaluation. Or try :help. scala> Array.empty[Double].intersect(Array(0.0)) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D ... 32 elided {code} was: Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6) However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 which is different from both Scala 2.13.5 and Scala 3. - https://github.com/scala/bug/issues/12403 {code} scala3-3.0.0:$ bin/scala scala> Array.empty[Double].intersect(Array(0.0)) val res0: Array[Double] = Array() scala-2.13.6:$ bin/scala Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292). Type in expressions for evaluation. Or try :help. scala> Array.empty[Double].intersect(Array(0.0)) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D ... 32 elided {code} > Upgrade Scala 2.13 to 2.13.7 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Major > > This issue aims to upgrade to Scala 2.13.7. > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). > However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 > which is different from both Scala 2.13.5 and Scala 3. > - https://github.com/scala/bug/issues/12403 > {code} > scala3-3.0.0:$ bin/scala > scala> Array.empty[Double].intersect(Array(0.0)) > val res0: Array[Double] = Array() > scala-2.13.6:$ bin/scala > Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292). > Type in expressions for evaluation. Or try :help. > scala> Array.empty[Double].intersect(Array(0.0)) > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D > ... 32 elided > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35496: -- Description: Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6) However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 which is different from both Scala 2.13.5 and Scala 3. - https://github.com/scala/bug/issues/12403 {code} scala3-3.0.0:$ bin/scala scala> Array.empty[Double].intersect(Array(0.0)) val res0: Array[Double] = Array() scala-2.13.6:$ bin/scala Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292). Type in expressions for evaluation. Or try :help. scala> Array.empty[Double].intersect(Array(0.0)) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D ... 32 elided {code} was:Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6) > Upgrade Scala 2.13 to 2.13.7 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Major > > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6) > However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 > which is different from both Scala 2.13.5 and Scala 3. > - https://github.com/scala/bug/issues/12403 > {code} > scala3-3.0.0:$ bin/scala > scala> Array.empty[Double].intersect(Array(0.0)) > val res0: Array[Double] = Array() > scala-2.13.6:$ bin/scala > Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292). > Type in expressions for evaluation. Or try :help. > scala> Array.empty[Double].intersect(Array(0.0)) > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D > ... 32 elided > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351187#comment-17351187 ] Dongjoon Hyun commented on SPARK-35496: --- Hi, [~LuciferYang]. Let's reuse this issue for Scala 2.13.7. > Upgrade Scala 2.13 from 2.13.5 to 2.13.6 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-35496: --- Assignee: (was: Apache Spark) > Upgrade Scala 2.13 from 2.13.5 to 2.13.6 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Major > > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35496: -- Summary: Upgrade Scala 2.13 to 2.13.7 (was: Upgrade Scala 2.13 from 2.13.5 to 2.13.6) > Upgrade Scala 2.13 to 2.13.7 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Major > > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35514) Automatically update version index of DocSearch via release-tag.sh
[ https://issues.apache.org/jira/browse/SPARK-35514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-35514. Fix Version/s: 3.1.2 3.2.0 Resolution: Fixed Issue resolved by pull request 32662 [https://github.com/apache/spark/pull/32662] > Automatically update version index of DocSearch via release-tag.sh > -- > > Key: SPARK-35514 > URL: https://issues.apache.org/jira/browse/SPARK-35514 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.2.0, 3.1.2 > > > Automatically update version index of DocSearch via release-tag.sh for > releasing new documentation site, instead of the current manual update. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35520) Spark-SQL test fails on IBM Z for certain config combinations.
[ https://issues.apache.org/jira/browse/SPARK-35520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simrit Kaur updated SPARK-35520: Description: Some queries of SQL related test cases: in-joins.sql, in-order-by.sql, not-in-group-by.sql and SubquerySuite.scala are failing with specific configuration combinations on IBM Z(s390x). For example: sql("select * from l where a = 6 and a not in (select c from r where c is not null)") query from SubquerySuite.scala fails for following config combinations: |enableNAAJ|enableAQE|enableCodegen| |TRUE|FALSE|FALSE| |TRUE|TRUE|FALSE| The above combination is also causing 2 other queries in in-joins.sql and in-order-by.sql failing. Another query: SELECT Count(*) FROM (SELECT * FROM t2 WHERE t2a NOT IN (SELECT t3a FROM t3 WHERE t3h != t2h)) t2 WHERE t2b NOT IN (SELECT Min(t2b) FROM t2 WHERE t2b = t2b GROUP BY t2c); from not-in-group-by.sql is failing for following combinations: |enableAQE|enableCodegen| |FALSE|TRUE| |FALSE|FALSE| These Test cases are not failing for 3.0.1 release and I believe might have been introduced with [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290] . There is another strange behaviour observed, if expected output is 1,3 , I am getting 1, 3, 9. If I update the Golden file to expect 1, 3, 9, the output will be 1, 3. was: Some queries of SQL related test cases: in-joins.sql, in-order-by.sql, not-in-group-by.sql and SubquerySuite.scala are failing with specific configuration combinations on IBM Z(s390x). For example: sql("select * from l where a = 6 and a not in (select c from r where c is not null)") query from SubquerySuite.scala fails for following config combinations: |enableNAAJ|enableAQE|enableCodegen| |TRUE|FALSE|FALSE| |TRUE|TRUE|FALSE| The above combination is also causing 2 other queries in in-joins.sql and in-order-by.sql failing. Another query: SELECT Count(*) FROM (SELECT * FROM t2 WHERE t2a NOT IN (SELECT t3a FROM t3 WHERE t3h != t2h)) t2 WHERE t2b NOT IN (SELECT Min(t2b) FROM t2 WHERE t2b = t2b GROUP BY t2c); from not-in-group-by.sql is failing for following combinations: |enableAQE|enableCodegen| |FALSE|TRUE| |FALSE|FALSE| These Test cases are not failing for 3.0.1 release and I believe might have been introduced with [#SPARK-32290] . There is another strange behaviour observed, if expected output is 1,3 , I am getting 1, 3, 9. If I update the Golden file to expect 1, 3, 9, the output will be 1, 3. > Spark-SQL test fails on IBM Z for certain config combinations. > -- > > Key: SPARK-35520 > URL: https://issues.apache.org/jira/browse/SPARK-35520 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.1 >Reporter: Simrit Kaur >Priority: Major > > Some queries of SQL related test cases: in-joins.sql, in-order-by.sql, > not-in-group-by.sql and SubquerySuite.scala are failing with specific > configuration combinations on IBM Z(s390x). > For example: > sql("select * from l where a = 6 and a not in (select c from r where c is not > null)") query from SubquerySuite.scala fails for following config > combinations: > |enableNAAJ|enableAQE|enableCodegen| > |TRUE|FALSE|FALSE| > |TRUE|TRUE|FALSE| > The above combination is also causing 2 other queries in in-joins.sql and > in-order-by.sql failing. > Another query: > SELECT Count(*) > FROM (SELECT * > FROM t2 > WHERE t2a NOT IN (SELECT t3a > FROM t3 > WHERE t3h != t2h)) t2 > WHERE t2b NOT IN (SELECT Min(t2b) > FROM t2 > WHERE t2b = t2b > GROUP BY t2c); > from not-in-group-by.sql is failing for following combinations: > |enableAQE|enableCodegen| > |FALSE|TRUE| > |FALSE|FALSE| > > These Test cases are not failing for 3.0.1 release and I believe might have > been introduced with > [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290] . > There is another strange behaviour observed, if expected output is 1,3 , I am > getting 1, 3, 9. If I update the Golden file to expect 1, 3, 9, the output > will be 1, 3. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35520) Spark-SQL test fails on IBM Z for certain config combinations.
Simrit Kaur created SPARK-35520: --- Summary: Spark-SQL test fails on IBM Z for certain config combinations. Key: SPARK-35520 URL: https://issues.apache.org/jira/browse/SPARK-35520 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 3.1.1 Reporter: Simrit Kaur Some queries of SQL related test cases: in-joins.sql, in-order-by.sql, not-in-group-by.sql and SubquerySuite.scala are failing with specific configuration combinations on IBM Z(s390x). For example: sql("select * from l where a = 6 and a not in (select c from r where c is not null)") query from SubquerySuite.scala fails for following config combinations: |enableNAAJ|enableAQE|enableCodegen| |TRUE|FALSE|FALSE| |TRUE|TRUE|FALSE| The above combination is also causing 2 other queries in in-joins.sql and in-order-by.sql failing. Another query: SELECT Count(*) FROM (SELECT * FROM t2 WHERE t2a NOT IN (SELECT t3a FROM t3 WHERE t3h != t2h)) t2 WHERE t2b NOT IN (SELECT Min(t2b) FROM t2 WHERE t2b = t2b GROUP BY t2c); from not-in-group-by.sql is failing for following combinations: |enableAQE|enableCodegen| |FALSE|TRUE| |FALSE|FALSE| These Test cases are not failing for 3.0.1 release and I believe might have been introduced with [#SPARK-32290] . There is another strange behaviour observed, if expected output is 1,3 , I am getting 1, 3, 9. If I update the Golden file to expect 1, 3, 9, the output will be 1, 3. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35517) Critical Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar
[ https://issues.apache.org/jira/browse/SPARK-35517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351134#comment-17351134 ] Vinod KC commented on SPARK-35517: -- [~ldeflandre], In Spark 3.2.0, SPARK-34784 upgraded, Jackson-databind version to 2.12.2 > Critical Vulnerabilities: jackson-databind 2.4.0 shipped with > htrace-core4-4.1.0-incubating.jar > --- > > Key: SPARK-35517 > URL: https://issues.apache.org/jira/browse/SPARK-35517 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.2 >Reporter: Louis DEFLANDRE >Priority: Major > > Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in > {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{jackson-databind}} > {{2.4.0}} : > * [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489] > * > [CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718] > This package is shipped within {{jars/htrace-core4-4.1.0-incubating.jar}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35517) Critical Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar
[ https://issues.apache.org/jira/browse/SPARK-35517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Louis DEFLANDRE updated SPARK-35517: Description: Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{jackson-databind}} {{2.4.0}} : * [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489] * [CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718] This package is shipped within {{jars/htrace-core4-4.1.0-incubating.jar}} was: Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in `spark-3.0.2-bin-hadoop3.2` coming from obsolete `jackson-databind` 2.4.0 : * [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489] * [CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718] This package is shipped within `jars/htrace-core4-4.1.0-incubating.jar` > Critical Vulnerabilities: jackson-databind 2.4.0 shipped with > htrace-core4-4.1.0-incubating.jar > --- > > Key: SPARK-35517 > URL: https://issues.apache.org/jira/browse/SPARK-35517 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.2 >Reporter: Louis DEFLANDRE >Priority: Major > > Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in > {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{jackson-databind}} > {{2.4.0}} : > * [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489] > * > [CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718] > This package is shipped within {{jars/htrace-core4-4.1.0-incubating.jar}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35518) Critical Vulnerabilities: log4j_log4j 1.2.17 shipped
[ https://issues.apache.org/jira/browse/SPARK-35518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Louis DEFLANDRE updated SPARK-35518: Description: Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{log4j_log4j}} {{1.2.17}} : * [CVE-2019-17571|https://nvd.nist.gov/vuln/detail/CVE-2019-17571] This package is shipped within {{jars/log4j-1.2.17.jar}} was: Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in `spark-3.0.2-bin-hadoop3.2` coming from obsolete `log4j_log4j` `1.2.17` : * [CVE-2019-17571|https://nvd.nist.gov/vuln/detail/CVE-2019-17571] This package is shipped within `jars/log4j-1.2.17.jar` > Critical Vulnerabilities: log4j_log4j 1.2.17 shipped > > > Key: SPARK-35518 > URL: https://issues.apache.org/jira/browse/SPARK-35518 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.2 >Reporter: Louis DEFLANDRE >Priority: Major > > Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in > {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{log4j_log4j}} {{1.2.17}} > : > * [CVE-2019-17571|https://nvd.nist.gov/vuln/detail/CVE-2019-17571] > This package is shipped within {{jars/log4j-1.2.17.jar}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35519) Critical Vulnerabilities: nimbusds_nimbus-jose-jwt 4.41.1 shipped
[ https://issues.apache.org/jira/browse/SPARK-35519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Louis DEFLANDRE updated SPARK-35519: Description: Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{nimbus-jose-jwt}} {{4.41.1}} : * [CVE-2019-17195|https://nvd.nist.gov/vuln/detail/CVE-2019-17195] This package is shipped within {{jars/nimbus-jose-jwt-4.41.1.jar}} was: Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in `spark-3.0.2-bin-hadoop3.2` coming from obsolete `nimbus-jose-jwt` `4.41.1` : * [CVE-2019-17195|https://nvd.nist.gov/vuln/detail/CVE-2019-17195] This package is shipped within `jars/nimbus-jose-jwt-4.41.1.jar` > Critical Vulnerabilities: nimbusds_nimbus-jose-jwt 4.41.1 shipped > - > > Key: SPARK-35519 > URL: https://issues.apache.org/jira/browse/SPARK-35519 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.2 >Reporter: Louis DEFLANDRE >Priority: Major > > Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in > {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{nimbus-jose-jwt}} > {{4.41.1}} : > * [CVE-2019-17195|https://nvd.nist.gov/vuln/detail/CVE-2019-17195] > This package is shipped within {{jars/nimbus-jose-jwt-4.41.1.jar}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35519) Critical Vulnerabilities: nimbusds_nimbus-jose-jwt 4.41.1 shipped
Louis DEFLANDRE created SPARK-35519: --- Summary: Critical Vulnerabilities: nimbusds_nimbus-jose-jwt 4.41.1 shipped Key: SPARK-35519 URL: https://issues.apache.org/jira/browse/SPARK-35519 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.2 Reporter: Louis DEFLANDRE Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in `spark-3.0.2-bin-hadoop3.2` coming from obsolete `nimbus-jose-jwt` `4.41.1` : * [CVE-2019-17195|https://nvd.nist.gov/vuln/detail/CVE-2019-17195] This package is shipped within `jars/nimbus-jose-jwt-4.41.1.jar` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10388) Public dataset loader interface
[ https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaurav Kumar updated SPARK-10388: - Comment: was deleted (was: I want to work on this issue [~mengxr], yet I am new to opensource. I would love to hear from you.) > Public dataset loader interface > --- > > Key: SPARK-10388 > URL: https://issues.apache.org/jira/browse/SPARK-10388 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Priority: Major > Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf > > > It is very useful to have a public dataset loader to fetch ML datasets from > popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, > requirements, and initial implementation. > {code} > val loader = new DatasetLoader(sqlContext) > val df = loader.get("libsvm", "rcv1_train.binary") > {code} > User should be able to list (or preview) datasets, e.g. > {code} > val datasets = loader.ls("libsvm") // returns a local DataFrame > datasets.show() // list all datasets under libsvm repo > {code} > It would be nice to allow 3rd-party packages to register new repos. Both the > API and implementation are pending discussion. Note that this requires http > and https support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35517) Critical Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar
[ https://issues.apache.org/jira/browse/SPARK-35517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Louis DEFLANDRE updated SPARK-35517: Summary: Critical Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar (was: Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar) > Critical Vulnerabilities: jackson-databind 2.4.0 shipped with > htrace-core4-4.1.0-incubating.jar > --- > > Key: SPARK-35517 > URL: https://issues.apache.org/jira/browse/SPARK-35517 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.2 >Reporter: Louis DEFLANDRE >Priority: Major > > Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in > `spark-3.0.2-bin-hadoop3.2` coming from obsolete `jackson-databind` 2.4.0 : > * [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489] > * > [CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718] > This package is shipped within `jars/htrace-core4-4.1.0-incubating.jar` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35518) Critical Vulnerabilities: log4j_log4j 1.2.17 shipped
Louis DEFLANDRE created SPARK-35518: --- Summary: Critical Vulnerabilities: log4j_log4j 1.2.17 shipped Key: SPARK-35518 URL: https://issues.apache.org/jira/browse/SPARK-35518 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.2 Reporter: Louis DEFLANDRE Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in `spark-3.0.2-bin-hadoop3.2` coming from obsolete `log4j_log4j` `1.2.17` : * [CVE-2019-17571|https://nvd.nist.gov/vuln/detail/CVE-2019-17571] This package is shipped within `jars/log4j-1.2.17.jar` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10388) Public dataset loader interface
[ https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351097#comment-17351097 ] Gaurav Kumar commented on SPARK-10388: -- I want to work on this issue [~mengxr], yet I am new to opensource. I would love to hear from you. > Public dataset loader interface > --- > > Key: SPARK-10388 > URL: https://issues.apache.org/jira/browse/SPARK-10388 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Priority: Major > Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf > > > It is very useful to have a public dataset loader to fetch ML datasets from > popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, > requirements, and initial implementation. > {code} > val loader = new DatasetLoader(sqlContext) > val df = loader.get("libsvm", "rcv1_train.binary") > {code} > User should be able to list (or preview) datasets, e.g. > {code} > val datasets = loader.ls("libsvm") // returns a local DataFrame > datasets.show() // list all datasets under libsvm repo > {code} > It would be nice to allow 3rd-party packages to register new repos. Both the > API and implementation are pending discussion. Note that this requires http > and https support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35517) Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar
Louis DEFLANDRE created SPARK-35517: --- Summary: Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar Key: SPARK-35517 URL: https://issues.apache.org/jira/browse/SPARK-35517 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.2 Reporter: Louis DEFLANDRE Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in `spark-3.0.2-bin-hadoop3.2` coming from obsolete `jackson-databind` 2.4.0 : * [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489] * [CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718] This package is shipped within `jars/htrace-core4-4.1.0-incubating.jar` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35516) Storage UI tab Storage Level tool tip correction
jobit mathew created SPARK-35516: Summary: Storage UI tab Storage Level tool tip correction Key: SPARK-35516 URL: https://issues.apache.org/jira/browse/SPARK-35516 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.1.1 Reporter: jobit mathew Storage UI tab Storage Level tool tip correction required. || | || | || | |Storage Level| | || please change *andreplication * to *and replication* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35396) Support to manual close/release entries in MemoryStore and InMemoryRelation instead of replying on GC
[ https://issues.apache.org/jira/browse/SPARK-35396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-35396: - Issue Type: Improvement (was: New Feature) Priority: Minor (was: Major) > Support to manual close/release entries in MemoryStore and InMemoryRelation > instead of replying on GC > - > > Key: SPARK-35396 > URL: https://issues.apache.org/jira/browse/SPARK-35396 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Chendi.Xue >Assignee: Apache Spark >Priority: Minor > Fix For: 3.2.0 > > > This PR is proposing a add-on to support to manual close entries in > MemoryStore and InMemoryRelation > h3. What changes were proposed in this pull request? > Currently: > MemoryStore uses a LinkedHashMap[BlockId, MemoryEntry[_]] to store all OnHeap > or OffHeap entries. > And when memoryStore.remove(blockId) is called, codes will simply remove one > entry from LinkedHashMap and leverage Java GC to do release work. > This PR: > We are proposing a add-on to manually close any object stored in MemoryStore > and InMemoryRelation if this object is extended from AutoCloseable. > Veifiication: > In our own use case, we implemented a user-defined off-heap-hashRelation for > BHJ, and we verified that by adding this manual close, we can make sure our > defined off-heap-hashRelation can be released when evict is called. > Also, we implemented user-defined cachedBatch and will be release when > InMemoryRelation.clearCache() is called by this PR > h3. Why are the changes needed? > This changes can help to clean some off-heap user-defined object may be > cached in InMemoryRelation or MemoryStore > h3. Does this PR introduce _any_ user-facing change? > NO > h3. How was this patch tested? > WIP > Signed-off-by: Chendi Xue [chendi@intel.com|mailto:chendi@intel.com] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35396) Support to manual close/release entries in MemoryStore and InMemoryRelation instead of replying on GC
[ https://issues.apache.org/jira/browse/SPARK-35396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-35396: Assignee: Apache Spark > Support to manual close/release entries in MemoryStore and InMemoryRelation > instead of replying on GC > - > > Key: SPARK-35396 > URL: https://issues.apache.org/jira/browse/SPARK-35396 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Chendi.Xue >Assignee: Apache Spark >Priority: Major > > This PR is proposing a add-on to support to manual close entries in > MemoryStore and InMemoryRelation > h3. What changes were proposed in this pull request? > Currently: > MemoryStore uses a LinkedHashMap[BlockId, MemoryEntry[_]] to store all OnHeap > or OffHeap entries. > And when memoryStore.remove(blockId) is called, codes will simply remove one > entry from LinkedHashMap and leverage Java GC to do release work. > This PR: > We are proposing a add-on to manually close any object stored in MemoryStore > and InMemoryRelation if this object is extended from AutoCloseable. > Veifiication: > In our own use case, we implemented a user-defined off-heap-hashRelation for > BHJ, and we verified that by adding this manual close, we can make sure our > defined off-heap-hashRelation can be released when evict is called. > Also, we implemented user-defined cachedBatch and will be release when > InMemoryRelation.clearCache() is called by this PR > h3. Why are the changes needed? > This changes can help to clean some off-heap user-defined object may be > cached in InMemoryRelation or MemoryStore > h3. Does this PR introduce _any_ user-facing change? > NO > h3. How was this patch tested? > WIP > Signed-off-by: Chendi Xue [chendi@intel.com|mailto:chendi@intel.com] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35396) Support to manual close/release entries in MemoryStore and InMemoryRelation instead of replying on GC
[ https://issues.apache.org/jira/browse/SPARK-35396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-35396. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32534 [https://github.com/apache/spark/pull/32534] > Support to manual close/release entries in MemoryStore and InMemoryRelation > instead of replying on GC > - > > Key: SPARK-35396 > URL: https://issues.apache.org/jira/browse/SPARK-35396 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Chendi.Xue >Assignee: Apache Spark >Priority: Major > Fix For: 3.2.0 > > > This PR is proposing a add-on to support to manual close entries in > MemoryStore and InMemoryRelation > h3. What changes were proposed in this pull request? > Currently: > MemoryStore uses a LinkedHashMap[BlockId, MemoryEntry[_]] to store all OnHeap > or OffHeap entries. > And when memoryStore.remove(blockId) is called, codes will simply remove one > entry from LinkedHashMap and leverage Java GC to do release work. > This PR: > We are proposing a add-on to manually close any object stored in MemoryStore > and InMemoryRelation if this object is extended from AutoCloseable. > Veifiication: > In our own use case, we implemented a user-defined off-heap-hashRelation for > BHJ, and we verified that by adding this manual close, we can make sure our > defined off-heap-hashRelation can be released when evict is called. > Also, we implemented user-defined cachedBatch and will be release when > InMemoryRelation.clearCache() is called by this PR > h3. Why are the changes needed? > This changes can help to clean some off-heap user-defined object may be > cached in InMemoryRelation or MemoryStore > h3. Does this PR introduce _any_ user-facing change? > NO > h3. How was this patch tested? > WIP > Signed-off-by: Chendi Xue [chendi@intel.com|mailto:chendi@intel.com] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35447) optimize skew join before coalescing shuffle partitions
[ https://issues.apache.org/jira/browse/SPARK-35447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35447: --- Assignee: Wenchen Fan > optimize skew join before coalescing shuffle partitions > --- > > Key: SPARK-35447 > URL: https://issues.apache.org/jira/browse/SPARK-35447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35447) optimize skew join before coalescing shuffle partitions
[ https://issues.apache.org/jira/browse/SPARK-35447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35447. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32594 [https://github.com/apache/spark/pull/32594] > optimize skew join before coalescing shuffle partitions > --- > > Key: SPARK-35447 > URL: https://issues.apache.org/jira/browse/SPARK-35447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29223) Kafka source: offset by timestamp - allow specifying timestamp for "all partitions"
[ https://issues.apache.org/jira/browse/SPARK-29223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-29223. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32609 [https://github.com/apache/spark/pull/32609] > Kafka source: offset by timestamp - allow specifying timestamp for "all > partitions" > --- > > Key: SPARK-29223 > URL: https://issues.apache.org/jira/browse/SPARK-29223 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Fix For: 3.2.0 > > > This issue is a follow-up of SPARK-26848. > In SPARK-26848, we decided to open possibility to let end users set > individual timestamp per partition. But in many cases, specifying timestamp > represents the intention that we would want to go back to specific timestamp > and reprocess records, which should be applied to all topics and partitions. > According to the format of > `startingOffsetsByTimestamp`/`endingOffsetsByTimestamp`, while it's not > intuitive to provide an option to set a global timestamp across topic, it's > still intuitive to provide an option to set a global timestamp across > partitions in a topic. > This issue tracks the efforts to deal with this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29223) Kafka source: offset by timestamp - allow specifying timestamp for "all partitions"
[ https://issues.apache.org/jira/browse/SPARK-29223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-29223: Assignee: Jungtaek Lim > Kafka source: offset by timestamp - allow specifying timestamp for "all > partitions" > --- > > Key: SPARK-29223 > URL: https://issues.apache.org/jira/browse/SPARK-29223 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > > This issue is a follow-up of SPARK-26848. > In SPARK-26848, we decided to open possibility to let end users set > individual timestamp per partition. But in many cases, specifying timestamp > represents the intention that we would want to go back to specific timestamp > and reprocess records, which should be applied to all topics and partitions. > According to the format of > `startingOffsetsByTimestamp`/`endingOffsetsByTimestamp`, while it's not > intuitive to provide an option to set a global timestamp across topic, it's > still intuitive to provide an option to set a global timestamp across > partitions in a topic. > This issue tracks the efforts to deal with this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35504) count distinct asterisk
[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikolay Sokolov resolved SPARK-35504. - Resolution: Fixed I could not fully comprehend what was written in the documentation. Helped to figure it out. > count distinct asterisk > > > Key: SPARK-35504 > URL: https://issues.apache.org/jira/browse/SPARK-35504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: {code:java} > uname -a > Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > {code} > > {code:java} > lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.4 LTS > Release: 18.04 > Codename: bionic > {code} > > {code:java} > /opt/spark/bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 > Branch HEAD > Compiled by user ubuntu on 2020-06-06T13:05:28Z > Revision 3fdfce3120f307147244e5eaf46d61419a723d50 > Url https://gitbox.apache.org/repos/asf/spark.git > Type --help for more information. > {code} > {code:java} > lscpu > Architecture:x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s):1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping:7 > CPU MHz: 3602.011 > BogoMIPS:6000.01 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache:1024K > L3 cache:36608K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf > tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms > invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke > {code} > >Reporter: Nikolay Sokolov >Priority: Minor > Attachments: SPARK-35504_first_query_plan.log, > SPARK-35504_second_query_plan.log > > > Hi everyone, > I hope you're well! > > Today I came across a very interesting case when the result of the execution > of the algorithm for counting unique rows differs depending on the form > (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. > I still can't figure out on my own if this is a bug or a feature and I would > like to share what I found. > > I run Spark SQL queries through the Thrift (and not only) connecting to the > Spark cluster. I use the DBeaver app to execute Spark SQL queries. > > So, I have two identical Spark SQL queries from an algorithmic point of view > that return different results. > > The first query: > {code:sql} > select count(distinct *) unique_amt from storage_datamart.olympiads > ; -- Rows: 13437678 > {code} > > The second query: > {code:sql} > select count(*) from (select distinct * from storage_datamart.olympiads) > ; -- Rows: 36901430 > {code} > > The result of the two queries is different. (But it must be the same, right!?) > {code:sql} > select 'The first query' description, count(distinct *) unique_amt from > storage_datamart.olympiads > union all > select 'The second query', count(*) from (select distinct * from > storage_datamart.olympiads) > ; > {code} > > The result of the above query is the following: > {code:java} > The first query13437678 > The second query 36901430 > {code} > > I can easily calculate the unique number of rows in the table: > {code:sql} > select count(*) from ( > select student_id, olympiad_id, tour, grade > from storage_datamart.olympiads >group by student_id, olympiad_id, tour, grade > having count(*) = 1 > ) > ; -- Rows: 36901365 > {code} > > The table DDL is the following: > {code:sql} > CREATE TABLE `storage_datamart`.`olympiads` ( > `ptn_date` DATE, > `student_id` BIGINT, > `olympiad_id` STRING, > `grade` BIGINT, > `grade_type` STRING, > `tour` STRING, > `created_at` TIMESTAMP, > `created_at_local` TIMESTAMP, > `olympiad_num` BIGINT, >
[jira] [Comment Edited] (SPARK-35504) count distinct asterisk
[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351018#comment-17351018 ] Nikolay Sokolov edited comment on SPARK-35504 at 5/25/21, 12:35 PM: Subtracted from the number of all rows of the table the number of rows containing NULL values in at least one column and got what I was looking for: {code:sql} select (select count(1) amt from storage_datamart.olympiads) - ( select count(1) from storage_datamart.olympiads where ptn_date is null or student_id is null or olympiad_id is null or grade is null or grade_type is null or tour is null or created_at is null or created_at_local is null or olympiad_num is null or olympiad_name is null or subject is null or started_at is null or ended_at is null or region_id is null or region_name is null or municipality_name is null or school_id is null or school_name is null or school_status is null or oly_n_common is null or num_day is null or award_type is null or new_student_legacy is null or segment is null or total_start is null or total_end is null or year_learn is null or parent_id is null or teacher_id is null or parallel is null or olympiad_type is null ) ; -- 13437678 {code} {code:sql} select amt - 23463820 from ( select count(1) amt from storage_datamart.olympiads ) ; -- 13437678 {code} This is a feature that is documented. I apologize. I'll close this task. Thank you! was (Author: melchizedek13): Subtracted from the number of all rows of the table the number of rows containing NULL values in at least one column and got what I was looking for: {code:sql} select (select count(1) amt from storage_datamart.olympiads) - ( select count(1) from storage_datamart.olympiads where ptn_date is null or student_id is null or olympiad_id is null or grade is null or grade_type is null or tour is null or created_at is null or created_at_local is null or olympiad_num is null or olympiad_name is null or subject is null or started_at is null or ended_at is null or region_id is null or region_name is null or municipality_name is null or school_id is null or school_name is null or school_status is null or oly_n_common is null or num_day is null or award_type is null or new_student_legacy is null or segment is null or total_start is null or total_end is null or year_learn is null or parent_id is null or teacher_id is null or parallel is null or olympiad_type is null ) ; -- 13437678 {code} {code:sql} select amt - 23463820 from ( select count(1) amt from storage_datamart.olympiads ) ; -- 13437678 {code} Apparently this is a feature that is not documented. I'll wait a day and close this task. > count distinct asterisk > > > Key: SPARK-35504 > URL: https://issues.apache.org/jira/browse/SPARK-35504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: {code:java} > uname -a > Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > {code} > > {code:java} > lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.4 LTS > Release: 18.04 > Codename: bionic > {code} > > {code:java} > /opt/spark/bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 > Branch HEAD > Compiled by user ubuntu on 2020-06-06T13:05:28Z > Revision 3fdfce3120f307147244e5eaf46d61419a723d50 > Url https://gitbox.apache.org/repos/asf/spark.git > Type --help for more information. > {code} > {code:java} > lscpu > Architecture:x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s):1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping:7 > CPU MHz: 3602.011 > BogoMIPS:6000.01 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache:1024K > L3 cache:36608K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx
[jira] [Assigned] (SPARK-35514) Automatically update version index of DocSearch via release-tag.sh
[ https://issues.apache.org/jira/browse/SPARK-35514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35514: Assignee: Apache Spark (was: Gengliang Wang) > Automatically update version index of DocSearch via release-tag.sh > -- > > Key: SPARK-35514 > URL: https://issues.apache.org/jira/browse/SPARK-35514 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Automatically update version index of DocSearch via release-tag.sh for > releasing new documentation site, instead of the current manual update. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35514) Automatically update version index of DocSearch via release-tag.sh
[ https://issues.apache.org/jira/browse/SPARK-35514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351021#comment-17351021 ] Apache Spark commented on SPARK-35514: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/32662 > Automatically update version index of DocSearch via release-tag.sh > -- > > Key: SPARK-35514 > URL: https://issues.apache.org/jira/browse/SPARK-35514 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Automatically update version index of DocSearch via release-tag.sh for > releasing new documentation site, instead of the current manual update. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35514) Automatically update version index of DocSearch via release-tag.sh
[ https://issues.apache.org/jira/browse/SPARK-35514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35514: Assignee: Gengliang Wang (was: Apache Spark) > Automatically update version index of DocSearch via release-tag.sh > -- > > Key: SPARK-35514 > URL: https://issues.apache.org/jira/browse/SPARK-35514 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Automatically update version index of DocSearch via release-tag.sh for > releasing new documentation site, instead of the current manual update. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35504) count distinct asterisk
[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351018#comment-17351018 ] Nikolay Sokolov commented on SPARK-35504: - Subtracted from the number of all rows of the table the number of rows containing NULL values in at least one column and got what I was looking for: {code:sql} select (select count(1) amt from storage_datamart.olympiads) - ( select count(1) from storage_datamart.olympiads where ptn_date is null or student_id is null or olympiad_id is null or grade is null or grade_type is null or tour is null or created_at is null or created_at_local is null or olympiad_num is null or olympiad_name is null or subject is null or started_at is null or ended_at is null or region_id is null or region_name is null or municipality_name is null or school_id is null or school_name is null or school_status is null or oly_n_common is null or num_day is null or award_type is null or new_student_legacy is null or segment is null or total_start is null or total_end is null or year_learn is null or parent_id is null or teacher_id is null or parallel is null or olympiad_type is null ) ; -- 13437678 {code} {code:sql} select amt - 23463820 from ( select count(1) amt from storage_datamart.olympiads ) ; -- 13437678 {code} Apparently this is a feature that is not documented. I'll wait a day and close this task. > count distinct asterisk > > > Key: SPARK-35504 > URL: https://issues.apache.org/jira/browse/SPARK-35504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: {code:java} > uname -a > Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > {code} > > {code:java} > lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.4 LTS > Release: 18.04 > Codename: bionic > {code} > > {code:java} > /opt/spark/bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 > Branch HEAD > Compiled by user ubuntu on 2020-06-06T13:05:28Z > Revision 3fdfce3120f307147244e5eaf46d61419a723d50 > Url https://gitbox.apache.org/repos/asf/spark.git > Type --help for more information. > {code} > {code:java} > lscpu > Architecture:x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s):1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping:7 > CPU MHz: 3602.011 > BogoMIPS:6000.01 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache:1024K > L3 cache:36608K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf > tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms > invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke > {code} > >Reporter: Nikolay Sokolov >Priority: Minor > Attachments: SPARK-35504_first_query_plan.log, > SPARK-35504_second_query_plan.log > > > Hi everyone, > I hope you're well! > > Today I came across a very interesting case when the result of the execution > of the algorithm for counting unique rows differs depending on the form > (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. > I still can't figure out on my own if this is a bug or a feature and I would > like to share what I found. > > I run Spark SQL queries through the Thrift (and not only) connecting to the > Spark cluster. I use the DBeaver app to execute Spark SQL queries. > > So, I have two identical Spark SQL queries from an algorithmic point of view > that return different results. > > The first query: > {code:sql} > select count(distinct *) unique_amt from storage_datamart.olympiads > ; -- Rows:
[jira] [Commented] (SPARK-35515) TimestampType: OverflowError: mktime argument out of range
[ https://issues.apache.org/jira/browse/SPARK-35515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351015#comment-17351015 ] Martin Studer commented on SPARK-35515: --- I'm happy to provide a PR if this seems like a sensible improvement. > TimestampType: OverflowError: mktime argument out of range > --- > > Key: SPARK-35515 > URL: https://issues.apache.org/jira/browse/SPARK-35515 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.1 >Reporter: Martin Studer >Priority: Minor > > This issue occurs, for example, when trying to create a data frame from > Python {{datetime}} objects that are "out of range" where "out of range" is > platform-dependent due to the use of > [{{time.mktime}}|https://docs.python.org/3/library/time.html#time.mktime] in > {{TimestampType.toInternal}}: > {code} > import datetime > spark_session.createDataFrame([(datetime.datetime(, 12, 31, 0, 0),)]) > {code} > A more direct way to reproduce the issue is by invoking > {{TimestampType.toInternal}} directly: > {code} > import datetime > from pyspark.sql.types import TimestampType > dt = datetime.datetime(, 12, 31, 0, 0) > TimestampType().toInternal(dt) > {code} > The suggested improvement is to avoid using {{time.mktime}} to increase the > range of {{datetime}} values. A possible implementation may look as follows: > {code} > import datetime > import pytz > EPOCH_UTC = datetime.datetime(1970, 1, 1).replace(tzinfo=pytz.utc) > LOCAL_TZ = datetime.datetime.now().astimezone().tzinfo > def toInternal(dt): > if dt is not None: > dt = dt if dt.tzinfo else dt.replace(tzinfo=LOCAL_TZ) > dt_utc = dt.astimezone(pytz.utc) > td = dt_utc - EPOCH_UTC > return (td.days * 86400 + td.seconds) * 10 ** 6 + > td.microseconds > {code} > This relies on the ability to derive the local timezone. Other mechanisms may > be used to what is suggested above. > Test cases include: > {code} > dt1 = datetime.datetime(2021, 5, 25, 12, 23) > dt2 = dt1.replace(tzinfo=pytz.timezone('Europe/Zurich')) > dt3 = datetime.datetime(, 12, 31, 0, 0) > dt4 = dt3.replace(tzinfo=pytz.timezone('Europe/Zurich')) > toInternal(dt1) == TimestampType().toInternal(dt1) > toInternal(dt2) == TimestampType().toInternal(dt2) > toInternal(dt3) # TimestampType().toInternal(dt3) fails > toInternal(dt4) == TimestampType().toInternal(dt4) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33121) Spark Streaming 3.1.1 hangs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351012#comment-17351012 ] Dmitry Tverdokhleb commented on SPARK-33121: L. C. Hsieh, have you tested this case with sending SIGTERM signal when "for each" operation entered in sleeping mode? > Spark Streaming 3.1.1 hangs on shutdown > --- > > Key: SPARK-33121 > URL: https://issues.apache.org/jira/browse/SPARK-33121 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 3.1.1 >Reporter: Dmitry Tverdokhleb >Priority: Major > Labels: Streaming, hang, shutdown > > Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem > in graceful shutdown. > Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true". > Here is the code: > {code:java} > inputStream.foreachRDD { > rdd => > rdd.foreachPartition { > Thread.sleep(5000) > } > } > {code} > I send a SIGTERM signal to stop the spark streaming and after sleeping an > exception arises: > {noformat} > streaming-agg-tds-data_1 | java.util.concurrent.RejectedExecutionException: > Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from > java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, > active threads = 0, queued tasks = 0, completed tasks = 1] > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) > streaming-agg-tds-data_1 | at > org.apache.spark.executor.Executor.launchTask(Executor.scala:270) > streaming-agg-tds-data_1 | at > org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) > streaming-agg-tds-data_1 | at > org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) > streaming-agg-tds-data_1 | at > scala.collection.Iterator.foreach(Iterator.scala:941) > streaming-agg-tds-data_1 | at > scala.collection.Iterator.foreach$(Iterator.scala:941) > streaming-agg-tds-data_1 | at > scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > streaming-agg-tds-data_1 | at > scala.collection.IterableLike.foreach(IterableLike.scala:74) > streaming-agg-tds-data_1 | at > scala.collection.IterableLike.foreach$(IterableLike.scala:73) > streaming-agg-tds-data_1 | at > scala.collection.AbstractIterable.foreach(Iterable.scala:56) > streaming-agg-tds-data_1 | at > org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) > streaming-agg-tds-data_1 | at > org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > streaming-agg-tds-data_1 | at java.lang.Thread.run(Thread.java:748) > streaming-agg-tds-data_1 | 2021-04-22 13:33:41 WARN JobGenerator - Timed > out while stopping the job generator (timeout = 1) > streaming-agg-tds-data_1 | 2021-04-22 13:33:41 INFO JobGenerator - Waited > for jobs to be processed and checkpoints to be written > streaming-agg-tds-data_1 | 2021-04-22 13:33:41 INFO JobGenerator - Stopped > JobGenerator{noformat} > After this exception and "JobGenerator - Stopped JobGenerator" log, streaming > freezes, and halts by timeout (Config parameter > "hadoop.service.shutdown.timeout"). > Besides, there is no problem with the graceful shutdown in spark 2.4.5. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35515) TimestampType: OverflowError: mktime argument out of range
Martin Studer created SPARK-35515: - Summary: TimestampType: OverflowError: mktime argument out of range Key: SPARK-35515 URL: https://issues.apache.org/jira/browse/SPARK-35515 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.1.1 Reporter: Martin Studer This issue occurs, for example, when trying to create a data frame from Python {{datetime}} objects that are "out of range" where "out of range" is platform-dependent due to the use of [{{time.mktime}}|https://docs.python.org/3/library/time.html#time.mktime] in {{TimestampType.toInternal}}: {code} import datetime spark_session.createDataFrame([(datetime.datetime(, 12, 31, 0, 0),)]) {code} A more direct way to reproduce the issue is by invoking {{TimestampType.toInternal}} directly: {code} import datetime from pyspark.sql.types import TimestampType dt = datetime.datetime(, 12, 31, 0, 0) TimestampType().toInternal(dt) {code} The suggested improvement is to avoid using {{time.mktime}} to increase the range of {{datetime}} values. A possible implementation may look as follows: {code} import datetime import pytz EPOCH_UTC = datetime.datetime(1970, 1, 1).replace(tzinfo=pytz.utc) LOCAL_TZ = datetime.datetime.now().astimezone().tzinfo def toInternal(dt): if dt is not None: dt = dt if dt.tzinfo else dt.replace(tzinfo=LOCAL_TZ) dt_utc = dt.astimezone(pytz.utc) td = dt_utc - EPOCH_UTC return (td.days * 86400 + td.seconds) * 10 ** 6 + td.microseconds {code} This relies on the ability to derive the local timezone. Other mechanisms may be used to what is suggested above. Test cases include: {code} dt1 = datetime.datetime(2021, 5, 25, 12, 23) dt2 = dt1.replace(tzinfo=pytz.timezone('Europe/Zurich')) dt3 = datetime.datetime(, 12, 31, 0, 0) dt4 = dt3.replace(tzinfo=pytz.timezone('Europe/Zurich')) toInternal(dt1) == TimestampType().toInternal(dt1) toInternal(dt2) == TimestampType().toInternal(dt2) toInternal(dt3) # TimestampType().toInternal(dt3) fails toInternal(dt4) == TimestampType().toInternal(dt4) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35504) count distinct asterisk
[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351010#comment-17351010 ] Nikolay Sokolov commented on SPARK-35504: - It's really close to true: {code:sql} select 36901430 - 23463820 -- 13437610 {code} [~hyukjin.kwon] thank you! > count distinct asterisk > > > Key: SPARK-35504 > URL: https://issues.apache.org/jira/browse/SPARK-35504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: {code:java} > uname -a > Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > {code} > > {code:java} > lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.4 LTS > Release: 18.04 > Codename: bionic > {code} > > {code:java} > /opt/spark/bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 > Branch HEAD > Compiled by user ubuntu on 2020-06-06T13:05:28Z > Revision 3fdfce3120f307147244e5eaf46d61419a723d50 > Url https://gitbox.apache.org/repos/asf/spark.git > Type --help for more information. > {code} > {code:java} > lscpu > Architecture:x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s):1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping:7 > CPU MHz: 3602.011 > BogoMIPS:6000.01 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache:1024K > L3 cache:36608K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf > tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms > invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke > {code} > >Reporter: Nikolay Sokolov >Priority: Minor > Attachments: SPARK-35504_first_query_plan.log, > SPARK-35504_second_query_plan.log > > > Hi everyone, > I hope you're well! > > Today I came across a very interesting case when the result of the execution > of the algorithm for counting unique rows differs depending on the form > (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. > I still can't figure out on my own if this is a bug or a feature and I would > like to share what I found. > > I run Spark SQL queries through the Thrift (and not only) connecting to the > Spark cluster. I use the DBeaver app to execute Spark SQL queries. > > So, I have two identical Spark SQL queries from an algorithmic point of view > that return different results. > > The first query: > {code:sql} > select count(distinct *) unique_amt from storage_datamart.olympiads > ; -- Rows: 13437678 > {code} > > The second query: > {code:sql} > select count(*) from (select distinct * from storage_datamart.olympiads) > ; -- Rows: 36901430 > {code} > > The result of the two queries is different. (But it must be the same, right!?) > {code:sql} > select 'The first query' description, count(distinct *) unique_amt from > storage_datamart.olympiads > union all > select 'The second query', count(*) from (select distinct * from > storage_datamart.olympiads) > ; > {code} > > The result of the above query is the following: > {code:java} > The first query13437678 > The second query 36901430 > {code} > > I can easily calculate the unique number of rows in the table: > {code:sql} > select count(*) from ( > select student_id, olympiad_id, tour, grade > from storage_datamart.olympiads >group by student_id, olympiad_id, tour, grade > having count(*) = 1 > ) > ; -- Rows: 36901365 > {code} > > The table DDL is the following: > {code:sql} > CREATE TABLE `storage_datamart`.`olympiads` ( > `ptn_date` DATE, > `student_id` BIGINT, > `olympiad_id` STRING, > `grade` BIGINT, > `grade_type` STRING, > `tour` STRING, > `created_at` TIMESTAMP, > `created_at_local` TIMESTAMP, >
[jira] [Comment Edited] (SPARK-35504) count distinct asterisk
[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351008#comment-17351008 ] Nikolay Sokolov edited comment on SPARK-35504 at 5/25/21, 11:53 AM: [~hyukjin.kwon] thanks for the hint! I've just counted any nulls column's value by using the following script: {code:sql} select count(1) from storage_datamart.olympiads where ptn_date is null or student_id is null or olympiad_id is null or grade is null or grade_type is null or tour is null or created_at is null or created_at_local is null or olympiad_num is null or olympiad_name is null or subject is null or started_at is null or ended_at is null or region_id is null or region_name is null or municipality_name is null or school_id is null or school_name is null or school_status is null or oly_n_common is null or num_day is null or award_type is null or new_student_legacy is null or segment is null or total_start is null or total_end is null or year_learn is null or parent_id is null or teacher_id is null or parallel is null or olympiad_type is null ; {code} I've got 23463820 rows. was (Author: melchizedek13): [~hyukjin.kwon] thanks for the hint! I've just counted any nulls column's value by using the following script: {code:sql} select count(1) from storage_datamart.olympiads where ptn_date is null or student_id is null or olympiad_id is null or grade is null or grade_type is null or tour is null or created_at is null or created_at_local is null or olympiad_num is null or olympiad_name is null or subject is null or started_at is null or ended_at is null or region_id is null or region_name is null or municipality_name is null or school_id is null or school_name is null or school_status is null or oly_n_common is null or num_day is null or award_type is null or new_student_legacy is null or segment is null or total_start is null or total_end is null or year_learn is null or parent_id is null or teacher_id is null or parallel is null or olympiad_type is null ; {code} I've got 23463820 rows. This value differs from 13437678 & 36901430. > count distinct asterisk > > > Key: SPARK-35504 > URL: https://issues.apache.org/jira/browse/SPARK-35504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: {code:java} > uname -a > Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > {code} > > {code:java} > lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.4 LTS > Release: 18.04 > Codename: bionic > {code} > > {code:java} > /opt/spark/bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 > Branch HEAD > Compiled by user ubuntu on 2020-06-06T13:05:28Z > Revision 3fdfce3120f307147244e5eaf46d61419a723d50 > Url https://gitbox.apache.org/repos/asf/spark.git > Type --help for more information. > {code} > {code:java} > lscpu > Architecture:x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s):1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping:7 > CPU MHz: 3602.011 > BogoMIPS:6000.01 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache:1024K > L3 cache:36608K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf > tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms > invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke > {code} > >Reporter: Nikolay Sokolov >Priority: Minor > Attachments:
[jira] [Created] (SPARK-35514) Automatically update version index of DocSearch via release-tag.sh
Gengliang Wang created SPARK-35514: -- Summary: Automatically update version index of DocSearch via release-tag.sh Key: SPARK-35514 URL: https://issues.apache.org/jira/browse/SPARK-35514 Project: Spark Issue Type: Task Components: Project Infra Affects Versions: 3.2.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Automatically update version index of DocSearch via release-tag.sh for releasing new documentation site, instead of the current manual update. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35504) count distinct asterisk
[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351008#comment-17351008 ] Nikolay Sokolov commented on SPARK-35504: - [~hyukjin.kwon] thanks for the hint! I've just counted any nulls column's value by using the following script: {code:sql} select count(1) from storage_datamart.olympiads where ptn_date is null or student_id is null or olympiad_id is null or grade is null or grade_type is null or tour is null or created_at is null or created_at_local is null or olympiad_num is null or olympiad_name is null or subject is null or started_at is null or ended_at is null or region_id is null or region_name is null or municipality_name is null or school_id is null or school_name is null or school_status is null or oly_n_common is null or num_day is null or award_type is null or new_student_legacy is null or segment is null or total_start is null or total_end is null or year_learn is null or parent_id is null or teacher_id is null or parallel is null or olympiad_type is null ; {code} I've got 23463820 rows. This value differs from 13437678 & 36901430. > count distinct asterisk > > > Key: SPARK-35504 > URL: https://issues.apache.org/jira/browse/SPARK-35504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: {code:java} > uname -a > Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > {code} > > {code:java} > lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.4 LTS > Release: 18.04 > Codename: bionic > {code} > > {code:java} > /opt/spark/bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 > Branch HEAD > Compiled by user ubuntu on 2020-06-06T13:05:28Z > Revision 3fdfce3120f307147244e5eaf46d61419a723d50 > Url https://gitbox.apache.org/repos/asf/spark.git > Type --help for more information. > {code} > {code:java} > lscpu > Architecture:x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s):1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping:7 > CPU MHz: 3602.011 > BogoMIPS:6000.01 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache:1024K > L3 cache:36608K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf > tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms > invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke > {code} > >Reporter: Nikolay Sokolov >Priority: Minor > Attachments: SPARK-35504_first_query_plan.log, > SPARK-35504_second_query_plan.log > > > Hi everyone, > I hope you're well! > > Today I came across a very interesting case when the result of the execution > of the algorithm for counting unique rows differs depending on the form > (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. > I still can't figure out on my own if this is a bug or a feature and I would > like to share what I found. > > I run Spark SQL queries through the Thrift (and not only) connecting to the > Spark cluster. I use the DBeaver app to execute Spark SQL queries. > > So, I have two identical Spark SQL queries from an algorithmic point of view > that return different results. > > The first query: > {code:sql} > select count(distinct *) unique_amt from storage_datamart.olympiads > ; -- Rows: 13437678 > {code} > > The second query: > {code:sql} > select count(*) from (select distinct * from storage_datamart.olympiads) > ; -- Rows: 36901430 > {code} > > The result of the two queries is different. (But it must be the same, right!?) > {code:sql} > select
[jira] [Commented] (SPARK-35513) Upgrade joda-time to 2.10.10
[ https://issues.apache.org/jira/browse/SPARK-35513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351004#comment-17351004 ] Apache Spark commented on SPARK-35513: -- User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/32661 > Upgrade joda-time to 2.10.10 > > > Key: SPARK-35513 > URL: https://issues.apache.org/jira/browse/SPARK-35513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Vinod KC >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35513) Upgrade joda-time to 2.10.10
[ https://issues.apache.org/jira/browse/SPARK-35513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35513: Assignee: (was: Apache Spark) > Upgrade joda-time to 2.10.10 > > > Key: SPARK-35513 > URL: https://issues.apache.org/jira/browse/SPARK-35513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Vinod KC >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35513) Upgrade joda-time to 2.10.10
[ https://issues.apache.org/jira/browse/SPARK-35513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35513: Assignee: Apache Spark > Upgrade joda-time to 2.10.10 > > > Key: SPARK-35513 > URL: https://issues.apache.org/jira/browse/SPARK-35513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Vinod KC >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35513) Upgrade joda-time to 2.10.10
[ https://issues.apache.org/jira/browse/SPARK-35513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351002#comment-17351002 ] Apache Spark commented on SPARK-35513: -- User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/32661 > Upgrade joda-time to 2.10.10 > > > Key: SPARK-35513 > URL: https://issues.apache.org/jira/browse/SPARK-35513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Vinod KC >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org