[jira] [Assigned] (SPARK-35440) Add language type to `ExpressionInfo` for UDF

2021-05-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35440:
---

Assignee: Linhong Liu

> Add language type to `ExpressionInfo` for UDF
> -
>
> Key: SPARK-35440
> URL: https://issues.apache.org/jira/browse/SPARK-35440
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
>
> add "scala", "java", "python", "hive", "built-in"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35440) Add language type to `ExpressionInfo` for UDF

2021-05-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35440.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32587
[https://github.com/apache/spark/pull/32587]

> Add language type to `ExpressionInfo` for UDF
> -
>
> Key: SPARK-35440
> URL: https://issues.apache.org/jira/browse/SPARK-35440
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
> Fix For: 3.2.0
>
>
> add "scala", "java", "python", "hive", "built-in"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35527) Fix HiveExternalCatalogVersionsSuite to pass with Java 11

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35527:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Fix HiveExternalCatalogVersionsSuite to pass with Java 11
> -
>
> Key: SPARK-35527
> URL: https://issues.apache.org/jira/browse/SPARK-35527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> I'm personally checking whether all the tests pass with Java 11 for the 
> current master and I found HiveExternalCatalogVersionsSuite fails.
> The reason is that Spark 3.0.2 and 3.1.1 doesn't accept 2.3.8 as a hive 
> metastore version.
> HiveExternalCatalogVersionsSuite downloads Spark releases from 
> https://dist.apache.org/repos/dist/release/spark/ and run test for each 
> release. The Spark releases are 3.0.2 and 3.1.1 for now.
> With Java 11, the suite runs with a hive metastore version which corresponds 
> to the builtin Hive version and it's 2.3.8 for the current master.
> But branch-3.0 and branch-3.1 doesn't accept 2.3.8, the suite with Java 11 
> fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35527) Fix HiveExternalCatalogVersionsSuite to pass with Java 11

2021-05-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351482#comment-17351482
 ] 

Apache Spark commented on SPARK-35527:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32670

> Fix HiveExternalCatalogVersionsSuite to pass with Java 11
> -
>
> Key: SPARK-35527
> URL: https://issues.apache.org/jira/browse/SPARK-35527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> I'm personally checking whether all the tests pass with Java 11 for the 
> current master and I found HiveExternalCatalogVersionsSuite fails.
> The reason is that Spark 3.0.2 and 3.1.1 doesn't accept 2.3.8 as a hive 
> metastore version.
> HiveExternalCatalogVersionsSuite downloads Spark releases from 
> https://dist.apache.org/repos/dist/release/spark/ and run test for each 
> release. The Spark releases are 3.0.2 and 3.1.1 for now.
> With Java 11, the suite runs with a hive metastore version which corresponds 
> to the builtin Hive version and it's 2.3.8 for the current master.
> But branch-3.0 and branch-3.1 doesn't accept 2.3.8, the suite with Java 11 
> fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35527) Fix HiveExternalCatalogVersionsSuite to pass with Java 11

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35527:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Fix HiveExternalCatalogVersionsSuite to pass with Java 11
> -
>
> Key: SPARK-35527
> URL: https://issues.apache.org/jira/browse/SPARK-35527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> I'm personally checking whether all the tests pass with Java 11 for the 
> current master and I found HiveExternalCatalogVersionsSuite fails.
> The reason is that Spark 3.0.2 and 3.1.1 doesn't accept 2.3.8 as a hive 
> metastore version.
> HiveExternalCatalogVersionsSuite downloads Spark releases from 
> https://dist.apache.org/repos/dist/release/spark/ and run test for each 
> release. The Spark releases are 3.0.2 and 3.1.1 for now.
> With Java 11, the suite runs with a hive metastore version which corresponds 
> to the builtin Hive version and it's 2.3.8 for the current master.
> But branch-3.0 and branch-3.1 doesn't accept 2.3.8, the suite with Java 11 
> fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35527) Fix HiveExternalCatalogVersionsSuite to pass with Java 11

2021-05-25 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-35527:
--

 Summary: Fix HiveExternalCatalogVersionsSuite to pass with Java 11
 Key: SPARK-35527
 URL: https://issues.apache.org/jira/browse/SPARK-35527
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


I'm personally checking whether all the tests pass with Java 11 for the current 
master and I found HiveExternalCatalogVersionsSuite fails.
The reason is that Spark 3.0.2 and 3.1.1 doesn't accept 2.3.8 as a hive 
metastore version.

HiveExternalCatalogVersionsSuite downloads Spark releases from 
https://dist.apache.org/repos/dist/release/spark/ and run test for each 
release. The Spark releases are 3.0.2 and 3.1.1 for now.

With Java 11, the suite runs with a hive metastore version which corresponds to 
the builtin Hive version and it's 2.3.8 for the current master.

But branch-3.0 and branch-3.1 doesn't accept 2.3.8, the suite with Java 11 
fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35526) Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13

2021-05-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351481#comment-17351481
 ] 

Apache Spark commented on SPARK-35526:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32669

> Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13
> -
>
> Key: SPARK-35526
> URL: https://issues.apache.org/jira/browse/SPARK-35526
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Trivial
>
> Similar to SPARK-29291 and SPARK-33352, just to track Spark 3.2.0
>  
> There are still some compilation warnings about `procedure syntax is 
> deprecated`:
>  
> {code:java}
> [WARNING] [Warn] 
> /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: 
> [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return 
> type
> [WARNING] [Warn] 
> /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:748: 
> [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `unregisterMergeResult`'s 
> return type
> [WARNING] [Warn] 
> /spark/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala:223:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare 
> `testSimpleSpillingForAllCodecs`'s return type
> [WARNING] [Warn] 
> /spark/mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala:53:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `runBLASBenchmark`'s return type
> [WARNING] [Warn] 
> /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:110:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `assertEmptyRootPath`'s return 
> type
> [WARNING] [Warn] 
> /spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:602:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare 
> `executeCTASWithNonEmptyLocation`'s return type
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35526) Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35526:


Assignee: (was: Apache Spark)

> Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13
> -
>
> Key: SPARK-35526
> URL: https://issues.apache.org/jira/browse/SPARK-35526
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Trivial
>
> Similar to SPARK-29291 and SPARK-33352, just to track Spark 3.2.0
>  
> There are still some compilation warnings about `procedure syntax is 
> deprecated`:
>  
> {code:java}
> [WARNING] [Warn] 
> /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: 
> [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return 
> type
> [WARNING] [Warn] 
> /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:748: 
> [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `unregisterMergeResult`'s 
> return type
> [WARNING] [Warn] 
> /spark/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala:223:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare 
> `testSimpleSpillingForAllCodecs`'s return type
> [WARNING] [Warn] 
> /spark/mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala:53:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `runBLASBenchmark`'s return type
> [WARNING] [Warn] 
> /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:110:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `assertEmptyRootPath`'s return 
> type
> [WARNING] [Warn] 
> /spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:602:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare 
> `executeCTASWithNonEmptyLocation`'s return type
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35526) Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35526:


Assignee: Apache Spark

> Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13
> -
>
> Key: SPARK-35526
> URL: https://issues.apache.org/jira/browse/SPARK-35526
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Trivial
>
> Similar to SPARK-29291 and SPARK-33352, just to track Spark 3.2.0
>  
> There are still some compilation warnings about `procedure syntax is 
> deprecated`:
>  
> {code:java}
> [WARNING] [Warn] 
> /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: 
> [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return 
> type
> [WARNING] [Warn] 
> /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:748: 
> [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `unregisterMergeResult`'s 
> return type
> [WARNING] [Warn] 
> /spark/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala:223:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare 
> `testSimpleSpillingForAllCodecs`'s return type
> [WARNING] [Warn] 
> /spark/mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala:53:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `runBLASBenchmark`'s return type
> [WARNING] [Warn] 
> /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:110:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `assertEmptyRootPath`'s return 
> type
> [WARNING] [Warn] 
> /spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:602:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare 
> `executeCTASWithNonEmptyLocation`'s return type
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35526) Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13

2021-05-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351480#comment-17351480
 ] 

Apache Spark commented on SPARK-35526:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32669

> Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13
> -
>
> Key: SPARK-35526
> URL: https://issues.apache.org/jira/browse/SPARK-35526
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Trivial
>
> Similar to SPARK-29291 and SPARK-33352, just to track Spark 3.2.0
>  
> There are still some compilation warnings about `procedure syntax is 
> deprecated`:
>  
> {code:java}
> [WARNING] [Warn] 
> /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: 
> [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return 
> type
> [WARNING] [Warn] 
> /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:748: 
> [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `unregisterMergeResult`'s 
> return type
> [WARNING] [Warn] 
> /spark/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala:223:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare 
> `testSimpleSpillingForAllCodecs`'s return type
> [WARNING] [Warn] 
> /spark/mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala:53:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `runBLASBenchmark`'s return type
> [WARNING] [Warn] 
> /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:110:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare `assertEmptyRootPath`'s return 
> type
> [WARNING] [Warn] 
> /spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:602:
>  [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
> instead, add `: Unit =` to explicitly declare 
> `executeCTASWithNonEmptyLocation`'s return type
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34271) Use majorMinorPatchVersion for Hive version parsing

2021-05-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351478#comment-17351478
 ] 

Apache Spark commented on SPARK-34271:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32668

> Use majorMinorPatchVersion for Hive version parsing
> ---
>
> Key: SPARK-34271
> URL: https://issues.apache.org/jira/browse/SPARK-34271
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.2.0
>
>
> Currently {{IsolatedClientLoader}} need to enumerate all Hive patch versions. 
> Therefore, whenever we upgrade Hive version we'd have to remember to update 
> the method. It would be better if we just check major & minor version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35526) Re-cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13

2021-05-25 Thread Yang Jie (Jira)
Yang Jie created SPARK-35526:


 Summary: Re-cleanup `procedure syntax is deprecated` compilation 
warning in Scala 2.13
 Key: SPARK-35526
 URL: https://issues.apache.org/jira/browse/SPARK-35526
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Affects Versions: 3.2.0
Reporter: Yang Jie


Similar to SPARK-29291 and SPARK-33352, just to track Spark 3.2.0

 

There are still some compilation warnings about `procedure syntax is 
deprecated`:

 
{code:java}
[WARNING] [Warn] 
/spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: 
[deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return 
type
[WARNING] [Warn] 
/spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:748: 
[deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
instead, add `: Unit =` to explicitly declare `unregisterMergeResult`'s return 
type
[WARNING] [Warn] 
/spark/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala:223:
 [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
instead, add `: Unit =` to explicitly declare 
`testSimpleSpillingForAllCodecs`'s return type
[WARNING] [Warn] 
/spark/mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala:53:
 [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
instead, add `: Unit =` to explicitly declare `runBLASBenchmark`'s return type
[WARNING] [Warn] 
/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:110:
 [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
instead, add `: Unit =` to explicitly declare `assertEmptyRootPath`'s return 
type
[WARNING] [Warn] 
/spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:602:
 [deprecation @  | origin= | version=2.13.0] procedure syntax is deprecated: 
instead, add `: Unit =` to explicitly declare 
`executeCTASWithNonEmptyLocation`'s return type
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index

2021-05-25 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34859:
-
Priority: Critical  (was: Major)

> Vectorized parquet reader needs synchronization among pages for column index
> 
>
> Key: SPARK-34859
> URL: https://issues.apache.org/jira/browse/SPARK-34859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Li Xian
>Priority: Critical
>  Labels: correctness
> Attachments: 
> part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet
>
>
> the current implementation has a problem. the pages returned by 
> `readNextFilteredRowGroup` may not be aligned, some columns may have more 
> rows than others.
> Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` 
> with `rowIndexes` to make sure that rows are aligned. 
> Currently `VectorizedParquetRecordReader` doesn't have such synchronizing 
> among pages from different columns. Using `readNextFilteredRowGroup` may 
> result in incorrect result.
>  
> I have attache an example parquet file. This file is generated with 
> `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this 
> file is listed below.
> row group 0
> 
> _1:  INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED 
> [more]...
> _2:  INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 
> ENC:PLAIN,BIT_PACKED [more]...
>     _1 TV=2000 RL=0 DL=0
>     
> 
>     page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 2:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 3:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     _2 TV=2000 RL=0 DL=0
>     
> 
>     page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:1000
>     page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:1000
>  
> As you can see in the row group 0, column1 has 4 data pages each with 500 
> values and column 2 has 2 data pages with 1000 values each. 
> If we want to filter the rows by values with _1 = 510 using columnindex, 
> parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 
> of column _1 starts with row 500, and page 0 of column _2 starts with row 0, 
> and it will be incorrect if we simply read the two values as one row.
>  
> As an example, If you try filter with  _1 = 510 with column index on in 
> current version, it will give you the wrong result
> +---+---+
> |_1 |_2 |
> +---+---+
> |510|10 |
> +---+---+
> And if turn columnindex off, you can get the correct result
> +---+---+
> |_1 |_2 |
> +---+---+
> |510|510|
> +---+---+
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7

2021-05-25 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351473#comment-17351473
 ] 

Yang Jie commented on SPARK-35496:
--

ok [~dongjoon]

> Upgrade Scala 2.13 to 2.13.7
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Major
>
> This issue aims to upgrade to Scala 2.13.7.
> Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). 
> However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 
> which is different from both Scala 2.13.5 and Scala 3.
> - https://github.com/scala/bug/issues/12403
> {code}
> scala3-3.0.0:$ bin/scala
> scala> Array.empty[Double].intersect(Array(0.0))
> val res0: Array[Double] = Array()
> scala-2.13.6:$ bin/scala
> Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292).
> Type in expressions for evaluation. Or try :help.
> scala> Array.empty[Double].intersect(Array(0.0))
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D
>   ... 32 elided
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35378) Eagerly execute non-root Command

2021-05-25 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-35378:
---
Summary: Eagerly execute non-root Command  (was: Eagerly execute Command)

> Eagerly execute non-root Command
> 
>
> Key: SPARK-35378
> URL: https://issues.apache.org/jira/browse/SPARK-35378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark doesn't support LeafRunnableCommand as sub query.
> Because the LeafRunnableCommand always output GenericInternalRow and some 
> node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExec) will convert 
> GenericInternalRow to UnsafeRow. So will causes error as follows:
> {code:java}
> java.lang.ClassCastException
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index

2021-05-25 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34859:
-
Labels: correctness  (was: )

> Vectorized parquet reader needs synchronization among pages for column index
> 
>
> Key: SPARK-34859
> URL: https://issues.apache.org/jira/browse/SPARK-34859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Li Xian
>Priority: Major
>  Labels: correctness
> Attachments: 
> part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet
>
>
> the current implementation has a problem. the pages returned by 
> `readNextFilteredRowGroup` may not be aligned, some columns may have more 
> rows than others.
> Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` 
> with `rowIndexes` to make sure that rows are aligned. 
> Currently `VectorizedParquetRecordReader` doesn't have such synchronizing 
> among pages from different columns. Using `readNextFilteredRowGroup` may 
> result in incorrect result.
>  
> I have attache an example parquet file. This file is generated with 
> `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this 
> file is listed below.
> row group 0
> 
> _1:  INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED 
> [more]...
> _2:  INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 
> ENC:PLAIN,BIT_PACKED [more]...
>     _1 TV=2000 RL=0 DL=0
>     
> 
>     page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 2:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 3:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     _2 TV=2000 RL=0 DL=0
>     
> 
>     page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:1000
>     page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:1000
>  
> As you can see in the row group 0, column1 has 4 data pages each with 500 
> values and column 2 has 2 data pages with 1000 values each. 
> If we want to filter the rows by values with _1 = 510 using columnindex, 
> parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 
> of column _1 starts with row 500, and page 0 of column _2 starts with row 0, 
> and it will be incorrect if we simply read the two values as one row.
>  
> As an example, If you try filter with  _1 = 510 with column index on in 
> current version, it will give you the wrong result
> +---+---+
> |_1 |_2 |
> +---+---+
> |510|10 |
> +---+---+
> And if turn columnindex off, you can get the correct result
> +---+---+
> |_1 |_2 |
> +---+---+
> |510|510|
> +---+---+
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2021-05-25 Thread dc-heros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351464#comment-17351464
 ] 

dc-heros edited comment on SPARK-30696 at 5/26/21, 3:54 AM:


fromUTCtime and toUTCtime produced wrong result on Daylight Saving Time changes 
days
 For example, in LA in 1960, timezone switch from UTC-7h to UTC-8h at 2AM in 
1960-09-25 but previous version have the cutoff at 8AM

Because of this, for example 1960-09-25 1:30:00 in LA can be equal to both 
1960-09-25 08:30:00 and 1960-09-25 09:30:00 and the fromUTCtime just pick 1 of 
them, so there just wrong on the cutoff time in those function

Could you edit the description [~maxgekk]


was (Author: dc-heros):
fromUTCtime and toUTCtime produced wrong result on Daylight Saving Time changes 
days
For example, in LA in 1960, timezone switch from UTC-7h to UTC-8h at 2AM in 
1960-09-25 but previous version have the cutoff at 8AM

Because of this, for example 1960-09-25 1:30:00 in LA can be equal to both 
1960-09-25 08:30:00 and 1960-09-25 09:30:00, so there just wrong on the cutoff 
time from those function

Could you edit the description [~maxgekk]

> Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
> --
>
> Key: SPARK-30696
> URL: https://issues.apache.org/jira/browse/SPARK-30696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Max Gekk
>Priority: Major
>
> Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
> the original timestamp in the same time zone. In the range of 100 years, the 
> combination of functions returns wrong results 280 times out of 1753200:
> {code:java}
> scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
> SECS_PER_YEAR: Long = 31557600
> scala> val SECS_PER_MINUTE = 60L
> SECS_PER_MINUTE: Long = 60
> scala>  val tz = "America/Los_Angeles"
> tz: String = America/Los_Angeles
> scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
> SECS_PER_MINUTE)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val diff = 
> df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
> tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]
> scala> diff.count
> res14: Long = 280
> scala> df.count
> res15: Long = 1753200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2021-05-25 Thread dc-heros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351464#comment-17351464
 ] 

dc-heros commented on SPARK-30696:
--

fromUTCtime and toUTCtime produced wrong result on Daylight Saving Time changes 
days
For example, in LA in 1960, timezone switch from UTC-7h to UTC-8h at 2AM in 
1960-09-25 but previous version have the cutoff at 8AM

Because of this, for example 1960-09-25 1:30:00 in LA can be equal to both 
1960-09-25 08:30:00 and 1960-09-25 09:30:00, so there just wrong on the cutoff 
time from those function

Could you edit the description [~maxgekk]

> Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
> --
>
> Key: SPARK-30696
> URL: https://issues.apache.org/jira/browse/SPARK-30696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Max Gekk
>Priority: Major
>
> Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
> the original timestamp in the same time zone. In the range of 100 years, the 
> combination of functions returns wrong results 280 times out of 1753200:
> {code:java}
> scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
> SECS_PER_YEAR: Long = 31557600
> scala> val SECS_PER_MINUTE = 60L
> SECS_PER_MINUTE: Long = 60
> scala>  val tz = "America/Los_Angeles"
> tz: String = America/Los_Angeles
> scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
> SECS_PER_MINUTE)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val diff = 
> df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
> tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]
> scala> diff.count
> res14: Long = 280
> scala> df.count
> res15: Long = 1753200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30696:


Assignee: (was: Apache Spark)

> Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
> --
>
> Key: SPARK-30696
> URL: https://issues.apache.org/jira/browse/SPARK-30696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Max Gekk
>Priority: Major
>
> Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
> the original timestamp in the same time zone. In the range of 100 years, the 
> combination of functions returns wrong results 280 times out of 1753200:
> {code:java}
> scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
> SECS_PER_YEAR: Long = 31557600
> scala> val SECS_PER_MINUTE = 60L
> SECS_PER_MINUTE: Long = 60
> scala>  val tz = "America/Los_Angeles"
> tz: String = America/Los_Angeles
> scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
> SECS_PER_MINUTE)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val diff = 
> df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
> tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]
> scala> diff.count
> res14: Long = 280
> scala> df.count
> res15: Long = 1753200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30696:


Assignee: Apache Spark

> Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
> --
>
> Key: SPARK-30696
> URL: https://issues.apache.org/jira/browse/SPARK-30696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
> the original timestamp in the same time zone. In the range of 100 years, the 
> combination of functions returns wrong results 280 times out of 1753200:
> {code:java}
> scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
> SECS_PER_YEAR: Long = 31557600
> scala> val SECS_PER_MINUTE = 60L
> SECS_PER_MINUTE: Long = 60
> scala>  val tz = "America/Los_Angeles"
> tz: String = America/Los_Angeles
> scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
> SECS_PER_MINUTE)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val diff = 
> df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
> tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]
> scala> diff.count
> res14: Long = 280
> scala> df.count
> res15: Long = 1753200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2021-05-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351463#comment-17351463
 ] 

Apache Spark commented on SPARK-30696:
--

User 'dgd-contributor' has created a pull request for this issue:
https://github.com/apache/spark/pull/32666

> Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
> --
>
> Key: SPARK-30696
> URL: https://issues.apache.org/jira/browse/SPARK-30696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Max Gekk
>Priority: Major
>
> Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
> the original timestamp in the same time zone. In the range of 100 years, the 
> combination of functions returns wrong results 280 times out of 1753200:
> {code:java}
> scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
> SECS_PER_YEAR: Long = 31557600
> scala> val SECS_PER_MINUTE = 60L
> SECS_PER_MINUTE: Long = 60
> scala>  val tz = "America/Los_Angeles"
> tz: String = America/Los_Angeles
> scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
> SECS_PER_MINUTE)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val diff = 
> df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
> tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]
> scala> diff.count
> res14: Long = 280
> scala> df.count
> res15: Long = 1753200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35512) pyspark partitionBy may encounter 'OverflowError: cannot convert float infinity to integer'

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35512:


Assignee: Apache Spark

> pyspark partitionBy may encounter 'OverflowError: cannot convert float 
> infinity to integer'
> ---
>
> Key: SPARK-35512
> URL: https://issues.apache.org/jira/browse/SPARK-35512
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.2
>Reporter: nolan liu
>Assignee: Apache Spark
>Priority: Major
>
> h2. Code sample
> {code:python}
> # pyspark
> rdd = ...
> new_rdd = rdd.partitionBy(64){code}
> An OverflowError is raised when there is a {color:#ff}big input 
> file{color} and {color:#ff}executor memory{color} is not big enough.
> h2. Error information: 
>  
> {code:java}
> TaskSetManager: Lost task 312.0 in stage 1.0 (TID 748, 11.4.137.5, executor 
> 83): org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
> File "/opt/spark3/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
> process()
> File "/opt/spark3/python/lib/pyspark.zip/pyspark/worker.py", line 597, in 
> process
> serializer.dump_stream(out_iter, outfile)
> File "/opt/spark3/python/lib/pyspark.zip/pyspark/serializers.py", line 141, 
> in dump_stream
> for obj in iterator:
> File "/opt/spark3/python/lib/pyspark.zip/pyspark/rdd.py", line 1899, in 
> add_shuffle_key
> OverflowError: cannot convert float infinity to integer
> at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
> at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1209)
> at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1215)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:156)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> at org.apache.spark.scheduler.Task.run(Task.scala:130)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1420)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> h2. Spark code
>  [https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L2072]
> {code:python}
> for k, v in iterator:
> buckets[partitionFunc(k) % numPartitions].append((k, v))
> c += 1# check used memory and avg size of 
> chunk of objects 
> if (c % 1000 == 0 and get_used_memory() > limit
> or c > batch):
> n, size = len(buckets), 0
> for split in list(buckets.keys()):
> yield pack_long(split)
> d = outputSerializer.dumps(buckets[split])
> del buckets[split]
> yield d
> size += len(d)avg = int(size / n) 
> >> 20
> # let 1M < avg < 10M
> if avg < 1:
> batch *= 1.5
> elif avg > 10:
> batch = max(int(batch / 1.5), 1)
> c = 0
> {code}
> h2. Explanation
> *`batch`* may grow infinity when `*get_used_memory() > limit*` is true, then 
> overflow at `*max(int(batch / 1.5), 1)*`
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35512) pyspark partitionBy may encounter 'OverflowError: cannot convert float infinity to integer'

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35512:


Assignee: (was: Apache Spark)

> pyspark partitionBy may encounter 'OverflowError: cannot convert float 
> infinity to integer'
> ---
>
> Key: SPARK-35512
> URL: https://issues.apache.org/jira/browse/SPARK-35512
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.2
>Reporter: nolan liu
>Priority: Major
>
> h2. Code sample
> {code:python}
> # pyspark
> rdd = ...
> new_rdd = rdd.partitionBy(64){code}
> An OverflowError is raised when there is a {color:#ff}big input 
> file{color} and {color:#ff}executor memory{color} is not big enough.
> h2. Error information: 
>  
> {code:java}
> TaskSetManager: Lost task 312.0 in stage 1.0 (TID 748, 11.4.137.5, executor 
> 83): org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
> File "/opt/spark3/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
> process()
> File "/opt/spark3/python/lib/pyspark.zip/pyspark/worker.py", line 597, in 
> process
> serializer.dump_stream(out_iter, outfile)
> File "/opt/spark3/python/lib/pyspark.zip/pyspark/serializers.py", line 141, 
> in dump_stream
> for obj in iterator:
> File "/opt/spark3/python/lib/pyspark.zip/pyspark/rdd.py", line 1899, in 
> add_shuffle_key
> OverflowError: cannot convert float infinity to integer
> at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
> at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1209)
> at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1215)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:156)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> at org.apache.spark.scheduler.Task.run(Task.scala:130)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1420)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> h2. Spark code
>  [https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L2072]
> {code:python}
> for k, v in iterator:
> buckets[partitionFunc(k) % numPartitions].append((k, v))
> c += 1# check used memory and avg size of 
> chunk of objects 
> if (c % 1000 == 0 and get_used_memory() > limit
> or c > batch):
> n, size = len(buckets), 0
> for split in list(buckets.keys()):
> yield pack_long(split)
> d = outputSerializer.dumps(buckets[split])
> del buckets[split]
> yield d
> size += len(d)avg = int(size / n) 
> >> 20
> # let 1M < avg < 10M
> if avg < 1:
> batch *= 1.5
> elif avg > 10:
> batch = max(int(batch / 1.5), 1)
> c = 0
> {code}
> h2. Explanation
> *`batch`* may grow infinity when `*get_used_memory() > limit*` is true, then 
> overflow at `*max(int(batch / 1.5), 1)*`
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35512) pyspark partitionBy may encounter 'OverflowError: cannot convert float infinity to integer'

2021-05-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351461#comment-17351461
 ] 

Apache Spark commented on SPARK-35512:
--

User 'nolanliou' has created a pull request for this issue:
https://github.com/apache/spark/pull/32667

> pyspark partitionBy may encounter 'OverflowError: cannot convert float 
> infinity to integer'
> ---
>
> Key: SPARK-35512
> URL: https://issues.apache.org/jira/browse/SPARK-35512
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.2
>Reporter: nolan liu
>Priority: Major
>
> h2. Code sample
> {code:python}
> # pyspark
> rdd = ...
> new_rdd = rdd.partitionBy(64){code}
> An OverflowError is raised when there is a {color:#ff}big input 
> file{color} and {color:#ff}executor memory{color} is not big enough.
> h2. Error information: 
>  
> {code:java}
> TaskSetManager: Lost task 312.0 in stage 1.0 (TID 748, 11.4.137.5, executor 
> 83): org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
> File "/opt/spark3/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
> process()
> File "/opt/spark3/python/lib/pyspark.zip/pyspark/worker.py", line 597, in 
> process
> serializer.dump_stream(out_iter, outfile)
> File "/opt/spark3/python/lib/pyspark.zip/pyspark/serializers.py", line 141, 
> in dump_stream
> for obj in iterator:
> File "/opt/spark3/python/lib/pyspark.zip/pyspark/rdd.py", line 1899, in 
> add_shuffle_key
> OverflowError: cannot convert float infinity to integer
> at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
> at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1209)
> at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1215)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:156)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> at org.apache.spark.scheduler.Task.run(Task.scala:130)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1420)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> h2. Spark code
>  [https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L2072]
> {code:python}
> for k, v in iterator:
> buckets[partitionFunc(k) % numPartitions].append((k, v))
> c += 1# check used memory and avg size of 
> chunk of objects 
> if (c % 1000 == 0 and get_used_memory() > limit
> or c > batch):
> n, size = len(buckets), 0
> for split in list(buckets.keys()):
> yield pack_long(split)
> d = outputSerializer.dumps(buckets[split])
> del buckets[split]
> yield d
> size += len(d)avg = int(size / n) 
> >> 20
> # let 1M < avg < 10M
> if avg < 1:
> batch *= 1.5
> elif avg > 10:
> batch = max(int(batch / 1.5), 1)
> c = 0
> {code}
> h2. Explanation
> *`batch`* may grow infinity when `*get_used_memory() > limit*` is true, then 
> overflow at `*max(int(batch / 1.5), 1)*`
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index

2021-05-25 Thread Li Xian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Xian updated SPARK-34859:

Description: 
the current implementation has a problem. the pages returned by 
`readNextFilteredRowGroup` may not be aligned, some columns may have more rows 
than others.

Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` 
with `rowIndexes` to make sure that rows are aligned. 

Currently `VectorizedParquetRecordReader` doesn't have such synchronizing among 
pages from different columns. Using `readNextFilteredRowGroup` may result in 
incorrect result.

 
I have attache an example parquet file. This file is generated with 
`spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this 
file is listed below.
row group 0

_1:  INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED 
[more]...
_2:  INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 ENC:PLAIN,BIT_PACKED 
[more]...

    _1 TV=2000 RL=0 DL=0
    
    page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
[more]... VC:500
    page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
[more]... VC:500
    page 2:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
[more]... VC:500
    page 3:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
[more]... VC:500

    _2 TV=2000 RL=0 DL=0
    
    page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
[more]... VC:1000
    page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
[more]... VC:1000
 
As you can see in the row group 0, column1 has 4 data pages each with 500 
values and column 2 has 2 data pages with 1000 values each. 
If we want to filter the rows by values with _1 = 510 using columnindex, 
parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 of 
column _1 starts with row 500, and page 0 of column _2 starts with row 0, and 
it will be incorrect if we simply read the two values as one row.
 
As an example, If you try filter with  _1 = 510 with column index on in current 
version, it will give you the wrong result
+---+---+
|_1 |_2 |
+---+---+
|510|10 |
+---+---+
And if turn columnindex off, you can get the correct result
+---+---+
|_1 |_2 |
+---+---+
|510|510|
+---+---+
 

  was:
the current implementation has a problem. the pages returned by 
`readNextFilteredRowGroup` may not be aligned, some columns may have more rows 
than others.

Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` 
with `rowIndexes` to make sure that rows are aligned. 

Currently `VectorizedParquetRecordReader` doesn't have such synchronizing among 
pages from different columns. Using `readNextFilteredRowGroup` may result in 
incorrect result.


> Vectorized parquet reader needs synchronization among pages for column index
> 
>
> Key: SPARK-34859
> URL: https://issues.apache.org/jira/browse/SPARK-34859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Li Xian
>Priority: Major
> Attachments: 
> part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet
>
>
> the current implementation has a problem. the pages returned by 
> `readNextFilteredRowGroup` may not be aligned, some columns may have more 
> rows than others.
> Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` 
> with `rowIndexes` to make sure that rows are aligned. 
> Currently `VectorizedParquetRecordReader` doesn't have such synchronizing 
> among pages from different columns. Using `readNextFilteredRowGroup` may 
> result in incorrect result.
>  
> I have attache an example parquet file. This file is generated with 
> `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this 
> file is listed below.
> row group 0
> 
> _1:  INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED 
> [more]...
> _2:  INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 
> ENC:PLAIN,BIT_PACKED [more]...
>     _1 TV=2000 RL=0 DL=0
>     
> 
>     page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 2:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 3:  DLE:BIT_PACKED RLE:BIT_PACKED 

[jira] [Updated] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index

2021-05-25 Thread Li Xian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Xian updated SPARK-34859:

Attachment: 
part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet

> Vectorized parquet reader needs synchronization among pages for column index
> 
>
> Key: SPARK-34859
> URL: https://issues.apache.org/jira/browse/SPARK-34859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Li Xian
>Priority: Major
> Attachments: 
> part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet
>
>
> the current implementation has a problem. the pages returned by 
> `readNextFilteredRowGroup` may not be aligned, some columns may have more 
> rows than others.
> Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` 
> with `rowIndexes` to make sure that rows are aligned. 
> Currently `VectorizedParquetRecordReader` doesn't have such synchronizing 
> among pages from different columns. Using `readNextFilteredRowGroup` may 
> result in incorrect result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32195) Standardize warning types and messages

2021-05-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32195.
--
Resolution: Done

> Standardize warning types and messages
> --
>
> Key: SPARK-32195
> URL: https://issues.apache.org/jira/browse/SPARK-32195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Currently PySpark uses a somewhat inconsistent warning type and message such 
> as UserWarning. We should standardize it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32194) Standardize exceptions in PySpark

2021-05-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32194.
--
Fix Version/s: 3.2.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/32650

> Standardize exceptions in PySpark
> -
>
> Key: SPARK-32194
> URL: https://issues.apache.org/jira/browse/SPARK-32194
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, PySpark throws {{Exception}} or just {{RuntimeException}} in many 
> cases. We should standardize them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35496:


Assignee: Apache Spark

> Upgrade Scala 2.13 to 2.13.7
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> This issue aims to upgrade to Scala 2.13.7.
> Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). 
> However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 
> which is different from both Scala 2.13.5 and Scala 3.
> - https://github.com/scala/bug/issues/12403
> {code}
> scala3-3.0.0:$ bin/scala
> scala> Array.empty[Double].intersect(Array(0.0))
> val res0: Array[Double] = Array()
> scala-2.13.6:$ bin/scala
> Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292).
> Type in expressions for evaluation. Or try :help.
> scala> Array.empty[Double].intersect(Array(0.0))
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D
>   ... 32 elided
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35496:


Assignee: (was: Apache Spark)

> Upgrade Scala 2.13 to 2.13.7
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Major
>
> This issue aims to upgrade to Scala 2.13.7.
> Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). 
> However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 
> which is different from both Scala 2.13.5 and Scala 3.
> - https://github.com/scala/bug/issues/12403
> {code}
> scala3-3.0.0:$ bin/scala
> scala> Array.empty[Double].intersect(Array(0.0))
> val res0: Array[Double] = Array()
> scala-2.13.6:$ bin/scala
> Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292).
> Type in expressions for evaluation. Or try :help.
> scala> Array.empty[Double].intersect(Array(0.0))
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D
>   ... 32 elided
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35525) Define UDTs in schemas using string format

2021-05-25 Thread Julian Shalaby (Jira)
Julian Shalaby created SPARK-35525:
--

 Summary: Define UDTs in schemas using string format
 Key: SPARK-35525
 URL: https://issues.apache.org/jira/browse/SPARK-35525
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.1
Reporter: Julian Shalaby


In PySpark where UDTs are public in 3.1.1 for example, you can define a schema 
using UDTs in the format:

schema = StructType([StructField("Stuff", MyUDT())])

but the format

schema = "Stuff MyUDT"

does not work.

UDTs are officially being made public again in 3.2.0 for Scala, so this issue 
is pretty important now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35524) Pass objects as parameters to SparkSQL UDFs

2021-05-25 Thread Julian Shalaby (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian Shalaby updated SPARK-35524:
---
Shepherd:   (was: Sean R. Owen)

> Pass objects as parameters to SparkSQL UDFs
> ---
>
> Key: SPARK-35524
> URL: https://issues.apache.org/jira/browse/SPARK-35524
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.1.1
>Reporter: Julian Shalaby
>Priority: Major
>  Labels: UDF, UDT, spark, spark-sql
>
> You can pass class objects directly to UDFs using the UDF format:
> df.select("*").filter(myFunc(classObj)(col("colName")))
> but the format:
> """SELECT * FROM view WHERE myFunc(classObj, "colName")"""
> or
> """SELECT * FROM view WHERE myFunc(classObj)("colName")"""
> does not work. This would be a very useful feature to have, especially being 
> that UDTs are being made public again in 3.2.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35524) Pass objects as parameters to SparkSQL UDFs

2021-05-25 Thread Julian Shalaby (Jira)
Julian Shalaby created SPARK-35524:
--

 Summary: Pass objects as parameters to SparkSQL UDFs
 Key: SPARK-35524
 URL: https://issues.apache.org/jira/browse/SPARK-35524
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core, SQL
Affects Versions: 3.1.1
Reporter: Julian Shalaby


You can pass class objects directly to UDFs using the UDF format:

df.select("*").filter(myFunc(classObj)(col("colName")))

but the format:

"""SELECT * FROM view WHERE myFunc(classObj, "colName")"""

or

"""SELECT * FROM view WHERE myFunc(classObj)("colName")"""

does not work. This would be a very useful feature to have, especially being 
that UDTs are being made public again in 3.2.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35455) Enhance EliminateUnnecessaryJoin

2021-05-25 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-35455:
--
Priority: Major  (was: Minor)

> Enhance EliminateUnnecessaryJoin
> 
>
> Key: SPARK-35455
> URL: https://issues.apache.org/jira/browse/SPARK-35455
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.2.0
>
>
> Make EliminateUnnecessaryJoin support to eliminate outer join and multi-join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35455) Enhance EliminateUnnecessaryJoin

2021-05-25 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-35455:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Improvement)

> Enhance EliminateUnnecessaryJoin
> 
>
> Key: SPARK-35455
> URL: https://issues.apache.org/jira/browse/SPARK-35455
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
> Fix For: 3.2.0
>
>
> Make EliminateUnnecessaryJoin support to eliminate outer join and multi-join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-25 Thread Xianghao Lu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351426#comment-17351426
 ] 

Xianghao Lu commented on SPARK-35332:
-

Great, thank you very much for your work [~ulysses]

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35504) count distinct asterisk

2021-05-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351425#comment-17351425
 ] 

Hyukjin Kwon commented on SPARK-35504:
--

Thanks for confirmation and investigation man 

> count distinct asterisk 
> 
>
> Key: SPARK-35504
> URL: https://issues.apache.org/jira/browse/SPARK-35504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: {code:java}
> uname -a
> Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>  
> {code:java}
> lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 18.04.4 LTS
> Release:  18.04
> Codename: bionic
> {code}
>  
> {code:java}
> /opt/spark/bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0
>   /_/
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
> Branch HEAD
> Compiled by user ubuntu on 2020-06-06T13:05:28Z
> Revision 3fdfce3120f307147244e5eaf46d61419a723d50
> Url https://gitbox.apache.org/repos/asf/spark.git
> Type --help for more information.
> {code}
> {code:java}
> lscpu
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> CPU(s):  4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):   1
> NUMA node(s):1
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   85
> Model name:  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> Stepping:7
> CPU MHz: 3602.011
> BogoMIPS:6000.01
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:1024K
> L3 cache:36608K
> NUMA node0 CPU(s):   0-3
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
> constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
> tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
> popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
> 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
> invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
> {code}
>  
>Reporter: Nikolay Sokolov
>Priority: Minor
> Attachments: SPARK-35504_first_query_plan.log, 
> SPARK-35504_second_query_plan.log
>
>
> Hi everyone,
> I hope you're well!
>  
> Today I came across a very interesting case when the result of the execution 
> of the algorithm for counting unique rows differs depending on the form 
> (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.
> I still can't figure out on my own if this is a bug or a feature and I would 
> like to share what I found.
>  
> I run Spark SQL queries through the Thrift (and not only) connecting to the 
> Spark cluster. I use the DBeaver app to execute Spark SQL queries.
>  
> So, I have two identical Spark SQL queries from an algorithmic point of view 
> that return different results.
>  
> The first query:
> {code:sql}
> select count(distinct *) unique_amt from storage_datamart.olympiads
> ; -- Rows: 13437678
> {code}
>  
> The second query:
> {code:sql}
> select count(*) from (select distinct * from storage_datamart.olympiads)
> ; -- Rows: 36901430
> {code}
>  
> The result of the two queries is different. (But it must be the same, right!?)
> {code:sql}
> select 'The first query' description, count(distinct *) unique_amt from 
> storage_datamart.olympiads
>  union all
> select 'The second query', count(*) from (select distinct * from 
> storage_datamart.olympiads)
> ;
> {code}
>  
> The result of the above query is the following:
> {code:java}
> The first query13437678
> The second query   36901430
> {code}
>  
>  I can easily calculate the unique number of rows in the table:
> {code:sql}
> select count(*) from (
>   select student_id, olympiad_id, tour, grade
> from storage_datamart.olympiads
>group by student_id, olympiad_id, tour, grade
>   having count(*) = 1
> )
> ; -- Rows: 36901365
> {code}
>  
> The table DDL is the following:
> {code:sql}
> CREATE TABLE `storage_datamart`.`olympiads` (
>   `ptn_date` DATE,
>   `student_id` BIGINT,
>   `olympiad_id` STRING,
>   `grade` BIGINT,
>   `grade_type` STRING,
>   `tour` STRING,
>   `created_at` TIMESTAMP,
>   `created_at_local` TIMESTAMP,
>   `olympiad_num` BIGINT,
>   `olympiad_name` STRING,
>   `subject` STRING,

[jira] [Created] (SPARK-35523) Fix the default value properly in Data Source Options page

2021-05-25 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-35523:
---

 Summary: Fix the default value properly in Data Source Options page
 Key: SPARK-35523
 URL: https://issues.apache.org/jira/browse/SPARK-35523
 Project: Spark
  Issue Type: Sub-task
  Components: docs
Affects Versions: 3.2.0
Reporter: Haejoon Lee


The default value in Data Source Options page following the Python API 
documents, but we'd better to follow the Scaladoc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35521) List Python 3.8 installed libraries in build_and_test workflow

2021-05-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35521.
--
Resolution: Duplicate

> List Python 3.8 installed libraries in build_and_test workflow
> --
>
> Key: SPARK-35521
> URL: https://issues.apache.org/jira/browse/SPARK-35521
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> In the build_and_test workflow, tests are ran against both Python 3.6 and 
> Python 3.8. However, only libraries installed in Python 3.6 are listed. We 
> should list Python 3.8's installed libraries as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35506) Run tests with Python 3.9 in GitHub Actions

2021-05-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35506.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32657
[https://github.com/apache/spark/pull/32657]

> Run tests with Python 3.9 in GitHub Actions
> ---
>
> Key: SPARK-35506
> URL: https://issues.apache.org/jira/browse/SPARK-35506
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> We're currently running PySpark tests with Python 3.8. We should run it with 
> Python 3.9 to verify the latest python support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35522) Introduce BinaryOps for BinaryType

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35522:


Assignee: (was: Apache Spark)

> Introduce BinaryOps for BinaryType
> --
>
> Key: SPARK-35522
> URL: https://issues.apache.org/jira/browse/SPARK-35522
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> BinaryType, which represents byte sequence values in Spark, doesn't support 
> data-type-based operations yet. We are going to introduce BinaryOps for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35522) Introduce BinaryOps for BinaryType

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35522:


Assignee: Apache Spark

> Introduce BinaryOps for BinaryType
> --
>
> Key: SPARK-35522
> URL: https://issues.apache.org/jira/browse/SPARK-35522
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> BinaryType, which represents byte sequence values in Spark, doesn't support 
> data-type-based operations yet. We are going to introduce BinaryOps for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35522) Introduce BinaryOps for BinaryType

2021-05-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351373#comment-17351373
 ] 

Apache Spark commented on SPARK-35522:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32665

> Introduce BinaryOps for BinaryType
> --
>
> Key: SPARK-35522
> URL: https://issues.apache.org/jira/browse/SPARK-35522
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> BinaryType, which represents byte sequence values in Spark, doesn't support 
> data-type-based operations yet. We are going to introduce BinaryOps for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35522) Introduce BinaryOps for BinaryType

2021-05-25 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-35522:


 Summary: Introduce BinaryOps for BinaryType
 Key: SPARK-35522
 URL: https://issues.apache.org/jira/browse/SPARK-35522
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


BinaryType, which represents byte sequence values in Spark, doesn't support 
data-type-based operations yet. We are going to introduce BinaryOps for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35516) Storage UI tab Storage Level tool tip correction

2021-05-25 Thread Lidiya Nixon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351346#comment-17351346
 ] 

Lidiya Nixon commented on SPARK-35516:
--

I have raised a fix for this

https://github.com/apache/spark/pull/32664

> Storage UI tab Storage Level tool tip correction
> 
>
> Key: SPARK-35516
> URL: https://issues.apache.org/jira/browse/SPARK-35516
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Trivial
>
> Storage UI tab Storage Level tool tip correction required.
> ||
> | ||
> | ||
> | |Storage Level|
> | ||
> please change *andreplication * to *and replication*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35516) Storage UI tab Storage Level tool tip correction

2021-05-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351344#comment-17351344
 ] 

Apache Spark commented on SPARK-35516:
--

User 'lidiyag' has created a pull request for this issue:
https://github.com/apache/spark/pull/32664

> Storage UI tab Storage Level tool tip correction
> 
>
> Key: SPARK-35516
> URL: https://issues.apache.org/jira/browse/SPARK-35516
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Trivial
>
> Storage UI tab Storage Level tool tip correction required.
> ||
> | ||
> | ||
> | |Storage Level|
> | ||
> please change *andreplication * to *and replication*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35516) Storage UI tab Storage Level tool tip correction

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35516:


Assignee: Apache Spark

> Storage UI tab Storage Level tool tip correction
> 
>
> Key: SPARK-35516
> URL: https://issues.apache.org/jira/browse/SPARK-35516
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Assignee: Apache Spark
>Priority: Trivial
>
> Storage UI tab Storage Level tool tip correction required.
> ||
> | ||
> | ||
> | |Storage Level|
> | ||
> please change *andreplication * to *and replication*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35516) Storage UI tab Storage Level tool tip correction

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35516:


Assignee: (was: Apache Spark)

> Storage UI tab Storage Level tool tip correction
> 
>
> Key: SPARK-35516
> URL: https://issues.apache.org/jira/browse/SPARK-35516
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Trivial
>
> Storage UI tab Storage Level tool tip correction required.
> ||
> | ||
> | ||
> | |Storage Level|
> | ||
> please change *andreplication * to *and replication*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35513) Upgrade joda-time to 2.10.10

2021-05-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35513.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32661
[https://github.com/apache/spark/pull/32661]

> Upgrade joda-time to 2.10.10
> 
>
> Key: SPARK-35513
> URL: https://issues.apache.org/jira/browse/SPARK-35513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Minor
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35513) Upgrade joda-time to 2.10.10

2021-05-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35513:
-

Assignee: Vinod KC

> Upgrade joda-time to 2.10.10
> 
>
> Key: SPARK-35513
> URL: https://issues.apache.org/jira/browse/SPARK-35513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35505) Remove APIs that have been deprecated in Koalas.

2021-05-25 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-35505.
---
Fix Version/s: 3.2.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Issue resolved by pull request 32656
https://github.com/apache/spark/pull/32656

> Remove APIs that have been deprecated in Koalas.
> 
>
> Key: SPARK-35505
> URL: https://issues.apache.org/jira/browse/SPARK-35505
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>
> There are some APIs that have been deprecated in Koalas. We shouldn't have 
> those in pandas APIs on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35514) Automatically update version index of DocSearch via release-tag.sh

2021-05-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35514:
--
Fix Version/s: (was: 3.1.2)
   3.1.3

> Automatically update version index of DocSearch via release-tag.sh
> --
>
> Key: SPARK-35514
> URL: https://issues.apache.org/jira/browse/SPARK-35514
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> Automatically update version index of DocSearch via release-tag.sh for 
> releasing new documentation site, instead of the current manual update.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35521) List Python 3.8 installed libraries in build_and_test workflow

2021-05-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351218#comment-17351218
 ] 

Apache Spark commented on SPARK-35521:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32663

> List Python 3.8 installed libraries in build_and_test workflow
> --
>
> Key: SPARK-35521
> URL: https://issues.apache.org/jira/browse/SPARK-35521
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> In the build_and_test workflow, tests are ran against both Python 3.6 and 
> Python 3.8. However, only libraries installed in Python 3.6 are listed. We 
> should list Python 3.8's installed libraries as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35521) List Python 3.8 installed libraries in build_and_test workflow

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35521:


Assignee: Apache Spark

> List Python 3.8 installed libraries in build_and_test workflow
> --
>
> Key: SPARK-35521
> URL: https://issues.apache.org/jira/browse/SPARK-35521
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> In the build_and_test workflow, tests are ran against both Python 3.6 and 
> Python 3.8. However, only libraries installed in Python 3.6 are listed. We 
> should list Python 3.8's installed libraries as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35521) List Python 3.8 installed libraries in build_and_test workflow

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35521:


Assignee: (was: Apache Spark)

> List Python 3.8 installed libraries in build_and_test workflow
> --
>
> Key: SPARK-35521
> URL: https://issues.apache.org/jira/browse/SPARK-35521
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> In the build_and_test workflow, tests are ran against both Python 3.6 and 
> Python 3.8. However, only libraries installed in Python 3.6 are listed. We 
> should list Python 3.8's installed libraries as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35341) Introduce ExtentionDtypeOps

2021-05-25 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-35341:
-
Description: 
{{Now ___and, __or,_ ___rand, and __ror___ are not data type 
based.}}

So we would like to introduce these operators to the DataTypeOps class.

extension_dtypes process these operators differently from the rest of the types.

So we would also introduce ExtentionDtypeOps.

ExtentionDtypeOps would be helpful for other data-type-based operations, for 
example, to/from pandas conversion as well.

  was:
Now __and__, __or__,_ _rand__, and __ror__ are not data type based.

So we would like to introduce  __and__, __or__,_ _rand__, and __ror__ to each 
DataTypeOps subclass.

extension_dtypes process __and__, __or__,_ _rand__, and __ror__ differently 
from the rest of types.

So we would also introduce ExtentionDtypeOps.


> Introduce ExtentionDtypeOps
> ---
>
> Key: SPARK-35341
> URL: https://issues.apache.org/jira/browse/SPARK-35341
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {{Now ___and, __or,_ ___rand, and __ror___ are not data type 
> based.}}
> So we would like to introduce these operators to the DataTypeOps class.
> extension_dtypes process these operators differently from the rest of the 
> types.
> So we would also introduce ExtentionDtypeOps.
> ExtentionDtypeOps would be helpful for other data-type-based operations, for 
> example, to/from pandas conversion as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35521) List Python 3.8 installed libraries in build_and_test workflow

2021-05-25 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-35521:


 Summary: List Python 3.8 installed libraries in build_and_test 
workflow
 Key: SPARK-35521
 URL: https://issues.apache.org/jira/browse/SPARK-35521
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Xinrong Meng
 Fix For: 3.2.0


In the build_and_test workflow, tests are ran against both Python 3.6 and 
Python 3.8. However, only libraries installed in Python 3.6 are listed. We 
should list Python 3.8's installed libraries as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7

2021-05-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35496:
--
Description: 
This issue aims to upgrade to Scala 2.13.7.

Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). 
However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 
which is different from both Scala 2.13.5 and Scala 3.
- https://github.com/scala/bug/issues/12403
{code}
scala3-3.0.0:$ bin/scala
scala> Array.empty[Double].intersect(Array(0.0))
val res0: Array[Double] = Array()

scala-2.13.6:$ bin/scala
Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292).
Type in expressions for evaluation. Or try :help.

scala> Array.empty[Double].intersect(Array(0.0))
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D
  ... 32 elided
{code}

  was:
Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6)

However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 
which is different from both Scala 2.13.5 and Scala 3.
- https://github.com/scala/bug/issues/12403
{code}
scala3-3.0.0:$ bin/scala
scala> Array.empty[Double].intersect(Array(0.0))
val res0: Array[Double] = Array()

scala-2.13.6:$ bin/scala
Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292).
Type in expressions for evaluation. Or try :help.

scala> Array.empty[Double].intersect(Array(0.0))
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D
  ... 32 elided
{code}


> Upgrade Scala 2.13 to 2.13.7
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Major
>
> This issue aims to upgrade to Scala 2.13.7.
> Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). 
> However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 
> which is different from both Scala 2.13.5 and Scala 3.
> - https://github.com/scala/bug/issues/12403
> {code}
> scala3-3.0.0:$ bin/scala
> scala> Array.empty[Double].intersect(Array(0.0))
> val res0: Array[Double] = Array()
> scala-2.13.6:$ bin/scala
> Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292).
> Type in expressions for evaluation. Or try :help.
> scala> Array.empty[Double].intersect(Array(0.0))
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D
>   ... 32 elided
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7

2021-05-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35496:
--
Description: 
Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6)

However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 
which is different from both Scala 2.13.5 and Scala 3.
- https://github.com/scala/bug/issues/12403
{code}
scala3-3.0.0:$ bin/scala
scala> Array.empty[Double].intersect(Array(0.0))
val res0: Array[Double] = Array()

scala-2.13.6:$ bin/scala
Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292).
Type in expressions for evaluation. Or try :help.

scala> Array.empty[Double].intersect(Array(0.0))
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D
  ... 32 elided
{code}

  was:Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6)


> Upgrade Scala 2.13 to 2.13.7
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Major
>
> Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6)
> However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 
> which is different from both Scala 2.13.5 and Scala 3.
> - https://github.com/scala/bug/issues/12403
> {code}
> scala3-3.0.0:$ bin/scala
> scala> Array.empty[Double].intersect(Array(0.0))
> val res0: Array[Double] = Array()
> scala-2.13.6:$ bin/scala
> Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292).
> Type in expressions for evaluation. Or try :help.
> scala> Array.empty[Double].intersect(Array(0.0))
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D
>   ... 32 elided
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6

2021-05-25 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351187#comment-17351187
 ] 

Dongjoon Hyun commented on SPARK-35496:
---

Hi, [~LuciferYang]. Let's reuse this issue for Scala 2.13.7.

> Upgrade Scala 2.13 from 2.13.5 to 2.13.6
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6

2021-05-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-35496:
---
  Assignee: (was: Apache Spark)

> Upgrade Scala 2.13 from 2.13.5 to 2.13.6
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Major
>
> Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7

2021-05-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35496:
--
Summary: Upgrade Scala 2.13 to 2.13.7  (was: Upgrade Scala 2.13 from 2.13.5 
to 2.13.6)

> Upgrade Scala 2.13 to 2.13.7
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Major
>
> Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35514) Automatically update version index of DocSearch via release-tag.sh

2021-05-25 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-35514.

Fix Version/s: 3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 32662
[https://github.com/apache/spark/pull/32662]

> Automatically update version index of DocSearch via release-tag.sh
> --
>
> Key: SPARK-35514
> URL: https://issues.apache.org/jira/browse/SPARK-35514
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> Automatically update version index of DocSearch via release-tag.sh for 
> releasing new documentation site, instead of the current manual update.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35520) Spark-SQL test fails on IBM Z for certain config combinations.

2021-05-25 Thread Simrit Kaur (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simrit Kaur updated SPARK-35520:

Description: 
Some queries of SQL related test cases: in-joins.sql, in-order-by.sql, 
not-in-group-by.sql and SubquerySuite.scala are failing with specific 
configuration combinations on IBM Z(s390x).

For example: 

sql("select * from l where a = 6 and a not in (select c from r where c is not 
null)") query from SubquerySuite.scala fails for following config combinations:
|enableNAAJ|enableAQE|enableCodegen|
|TRUE|FALSE|FALSE|
|TRUE|TRUE|FALSE|

The above combination is also causing 2 other queries in in-joins.sql and 
in-order-by.sql failing.

Another query: 

SELECT Count(*)
 FROM (SELECT *
 FROM t2
 WHERE t2a NOT IN (SELECT t3a
 FROM t3
 WHERE t3h != t2h)) t2
 WHERE t2b NOT IN (SELECT Min(t2b)
 FROM t2
 WHERE t2b = t2b
 GROUP BY t2c);

from not-in-group-by.sql is failing for following combinations:
|enableAQE|enableCodegen|
|FALSE|TRUE|
|FALSE|FALSE|

 

These Test cases are not failing for 3.0.1 release and I believe might have 
been introduced with 
[SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290] . 

There is another strange behaviour observed, if expected output is 1,3 , I am 
getting 1, 3, 9. If I update the Golden file to expect 1, 3, 9, the output will 
be 1, 3.

  was:
Some queries of SQL related test cases: in-joins.sql, in-order-by.sql, 
not-in-group-by.sql and SubquerySuite.scala are failing with specific 
configuration combinations on IBM Z(s390x).

For example: 

sql("select * from l where a = 6 and a not in (select c from r where c is not 
null)") query from SubquerySuite.scala fails for following config combinations:
|enableNAAJ|enableAQE|enableCodegen|
|TRUE|FALSE|FALSE|
|TRUE|TRUE|FALSE|

The above combination is also causing 2 other queries in in-joins.sql and 
in-order-by.sql failing.

Another query: 

SELECT Count(*)
FROM (SELECT *
 FROM t2
 WHERE t2a NOT IN (SELECT t3a
 FROM t3
 WHERE t3h != t2h)) t2
WHERE t2b NOT IN (SELECT Min(t2b)
 FROM t2
 WHERE t2b = t2b
 GROUP BY t2c);

from not-in-group-by.sql is failing for following combinations:
|enableAQE|enableCodegen|
|FALSE|TRUE|
|FALSE|FALSE|

 

These Test cases are not failing for 3.0.1 release and I believe might have 
been introduced with [#SPARK-32290] . 

There is another strange behaviour observed, if expected output is 1,3 , I am 
getting 1, 3, 9. If I update the Golden file to expect 1, 3, 9, the output will 
be 1, 3.


> Spark-SQL test fails on IBM Z for certain config combinations.
> --
>
> Key: SPARK-35520
> URL: https://issues.apache.org/jira/browse/SPARK-35520
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.1
>Reporter: Simrit Kaur
>Priority: Major
>
> Some queries of SQL related test cases: in-joins.sql, in-order-by.sql, 
> not-in-group-by.sql and SubquerySuite.scala are failing with specific 
> configuration combinations on IBM Z(s390x).
> For example: 
> sql("select * from l where a = 6 and a not in (select c from r where c is not 
> null)") query from SubquerySuite.scala fails for following config 
> combinations:
> |enableNAAJ|enableAQE|enableCodegen|
> |TRUE|FALSE|FALSE|
> |TRUE|TRUE|FALSE|
> The above combination is also causing 2 other queries in in-joins.sql and 
> in-order-by.sql failing.
> Another query: 
> SELECT Count(*)
>  FROM (SELECT *
>  FROM t2
>  WHERE t2a NOT IN (SELECT t3a
>  FROM t3
>  WHERE t3h != t2h)) t2
>  WHERE t2b NOT IN (SELECT Min(t2b)
>  FROM t2
>  WHERE t2b = t2b
>  GROUP BY t2c);
> from not-in-group-by.sql is failing for following combinations:
> |enableAQE|enableCodegen|
> |FALSE|TRUE|
> |FALSE|FALSE|
>  
> These Test cases are not failing for 3.0.1 release and I believe might have 
> been introduced with 
> [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290] . 
> There is another strange behaviour observed, if expected output is 1,3 , I am 
> getting 1, 3, 9. If I update the Golden file to expect 1, 3, 9, the output 
> will be 1, 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35520) Spark-SQL test fails on IBM Z for certain config combinations.

2021-05-25 Thread Simrit Kaur (Jira)
Simrit Kaur created SPARK-35520:
---

 Summary: Spark-SQL test fails on IBM Z for certain config 
combinations.
 Key: SPARK-35520
 URL: https://issues.apache.org/jira/browse/SPARK-35520
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 3.1.1
Reporter: Simrit Kaur


Some queries of SQL related test cases: in-joins.sql, in-order-by.sql, 
not-in-group-by.sql and SubquerySuite.scala are failing with specific 
configuration combinations on IBM Z(s390x).

For example: 

sql("select * from l where a = 6 and a not in (select c from r where c is not 
null)") query from SubquerySuite.scala fails for following config combinations:
|enableNAAJ|enableAQE|enableCodegen|
|TRUE|FALSE|FALSE|
|TRUE|TRUE|FALSE|

The above combination is also causing 2 other queries in in-joins.sql and 
in-order-by.sql failing.

Another query: 

SELECT Count(*)
FROM (SELECT *
 FROM t2
 WHERE t2a NOT IN (SELECT t3a
 FROM t3
 WHERE t3h != t2h)) t2
WHERE t2b NOT IN (SELECT Min(t2b)
 FROM t2
 WHERE t2b = t2b
 GROUP BY t2c);

from not-in-group-by.sql is failing for following combinations:
|enableAQE|enableCodegen|
|FALSE|TRUE|
|FALSE|FALSE|

 

These Test cases are not failing for 3.0.1 release and I believe might have 
been introduced with [#SPARK-32290] . 

There is another strange behaviour observed, if expected output is 1,3 , I am 
getting 1, 3, 9. If I update the Golden file to expect 1, 3, 9, the output will 
be 1, 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35517) Critical Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar

2021-05-25 Thread Vinod KC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351134#comment-17351134
 ] 

Vinod KC commented on SPARK-35517:
--

[~ldeflandre], In Spark 3.2.0, SPARK-34784 upgraded, Jackson-databind  version 
to 2.12.2

> Critical Vulnerabilities: jackson-databind 2.4.0 shipped with 
> htrace-core4-4.1.0-incubating.jar
> ---
>
> Key: SPARK-35517
> URL: https://issues.apache.org/jira/browse/SPARK-35517
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Louis DEFLANDRE
>Priority: Major
>
> Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
> {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{jackson-databind}} 
> {{2.4.0}} :
>  * [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489]
>  * 
> [CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718]
> This package is shipped within {{jars/htrace-core4-4.1.0-incubating.jar}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35517) Critical Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar

2021-05-25 Thread Louis DEFLANDRE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Louis DEFLANDRE updated SPARK-35517:

Description: 
Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
{{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{jackson-databind}} 
{{2.4.0}} :
 * [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489]
 * 
[CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718]

This package is shipped within {{jars/htrace-core4-4.1.0-incubating.jar}}

  was:
Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
`spark-3.0.2-bin-hadoop3.2` coming from obsolete `jackson-databind` 2.4.0 :

* [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489]
* 
[CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718]

This package is shipped within `jars/htrace-core4-4.1.0-incubating.jar`




> Critical Vulnerabilities: jackson-databind 2.4.0 shipped with 
> htrace-core4-4.1.0-incubating.jar
> ---
>
> Key: SPARK-35517
> URL: https://issues.apache.org/jira/browse/SPARK-35517
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Louis DEFLANDRE
>Priority: Major
>
> Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
> {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{jackson-databind}} 
> {{2.4.0}} :
>  * [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489]
>  * 
> [CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718]
> This package is shipped within {{jars/htrace-core4-4.1.0-incubating.jar}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35518) Critical Vulnerabilities: log4j_log4j 1.2.17 shipped

2021-05-25 Thread Louis DEFLANDRE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Louis DEFLANDRE updated SPARK-35518:

Description: 
Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
{{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{log4j_log4j}} {{1.2.17}} :
 * [CVE-2019-17571|https://nvd.nist.gov/vuln/detail/CVE-2019-17571]

This package is shipped within {{jars/log4j-1.2.17.jar}}

  was:
Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
`spark-3.0.2-bin-hadoop3.2` coming from obsolete `log4j_log4j`  `1.2.17` :

* [CVE-2019-17571|https://nvd.nist.gov/vuln/detail/CVE-2019-17571]

This package is shipped within `jars/log4j-1.2.17.jar`



> Critical Vulnerabilities: log4j_log4j 1.2.17 shipped
> 
>
> Key: SPARK-35518
> URL: https://issues.apache.org/jira/browse/SPARK-35518
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Louis DEFLANDRE
>Priority: Major
>
> Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
> {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{log4j_log4j}} {{1.2.17}} 
> :
>  * [CVE-2019-17571|https://nvd.nist.gov/vuln/detail/CVE-2019-17571]
> This package is shipped within {{jars/log4j-1.2.17.jar}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35519) Critical Vulnerabilities: nimbusds_nimbus-jose-jwt 4.41.1 shipped

2021-05-25 Thread Louis DEFLANDRE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Louis DEFLANDRE updated SPARK-35519:

Description: 
Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
{{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{nimbus-jose-jwt}} 
{{4.41.1}} :

*  [CVE-2019-17195|https://nvd.nist.gov/vuln/detail/CVE-2019-17195]

This package is shipped within {{jars/nimbus-jose-jwt-4.41.1.jar}}


  was:


Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
`spark-3.0.2-bin-hadoop3.2` coming from obsolete `nimbus-jose-jwt` `4.41.1` :

*  [CVE-2019-17195|https://nvd.nist.gov/vuln/detail/CVE-2019-17195]

This package is shipped within `jars/nimbus-jose-jwt-4.41.1.jar`



> Critical Vulnerabilities: nimbusds_nimbus-jose-jwt 4.41.1 shipped
> -
>
> Key: SPARK-35519
> URL: https://issues.apache.org/jira/browse/SPARK-35519
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Louis DEFLANDRE
>Priority: Major
>
> Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
> {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{nimbus-jose-jwt}} 
> {{4.41.1}} :
> *  [CVE-2019-17195|https://nvd.nist.gov/vuln/detail/CVE-2019-17195]
> This package is shipped within {{jars/nimbus-jose-jwt-4.41.1.jar}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35519) Critical Vulnerabilities: nimbusds_nimbus-jose-jwt 4.41.1 shipped

2021-05-25 Thread Louis DEFLANDRE (Jira)
Louis DEFLANDRE created SPARK-35519:
---

 Summary: Critical Vulnerabilities: nimbusds_nimbus-jose-jwt 4.41.1 
shipped
 Key: SPARK-35519
 URL: https://issues.apache.org/jira/browse/SPARK-35519
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.2
Reporter: Louis DEFLANDRE




Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
`spark-3.0.2-bin-hadoop3.2` coming from obsolete `nimbus-jose-jwt` `4.41.1` :

*  [CVE-2019-17195|https://nvd.nist.gov/vuln/detail/CVE-2019-17195]

This package is shipped within `jars/nimbus-jose-jwt-4.41.1.jar`




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10388) Public dataset loader interface

2021-05-25 Thread Gaurav Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Kumar updated SPARK-10388:
-
Comment: was deleted

(was: I want to work on this issue [~mengxr], yet I am new to opensource. I 
would love to hear from you.)

> Public dataset loader interface
> ---
>
> Key: SPARK-10388
> URL: https://issues.apache.org/jira/browse/SPARK-10388
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Major
> Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf
>
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35517) Critical Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar

2021-05-25 Thread Louis DEFLANDRE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Louis DEFLANDRE updated SPARK-35517:

Summary: Critical Vulnerabilities: jackson-databind 2.4.0 shipped with 
htrace-core4-4.1.0-incubating.jar  (was: Vulnerabilities: jackson-databind 
2.4.0 shipped with htrace-core4-4.1.0-incubating.jar)

> Critical Vulnerabilities: jackson-databind 2.4.0 shipped with 
> htrace-core4-4.1.0-incubating.jar
> ---
>
> Key: SPARK-35517
> URL: https://issues.apache.org/jira/browse/SPARK-35517
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Louis DEFLANDRE
>Priority: Major
>
> Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
> `spark-3.0.2-bin-hadoop3.2` coming from obsolete `jackson-databind` 2.4.0 :
> * [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489]
> * 
> [CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718]
> This package is shipped within `jars/htrace-core4-4.1.0-incubating.jar`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35518) Critical Vulnerabilities: log4j_log4j 1.2.17 shipped

2021-05-25 Thread Louis DEFLANDRE (Jira)
Louis DEFLANDRE created SPARK-35518:
---

 Summary: Critical Vulnerabilities: log4j_log4j 1.2.17 shipped
 Key: SPARK-35518
 URL: https://issues.apache.org/jira/browse/SPARK-35518
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.2
Reporter: Louis DEFLANDRE


Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
`spark-3.0.2-bin-hadoop3.2` coming from obsolete `log4j_log4j`  `1.2.17` :

* [CVE-2019-17571|https://nvd.nist.gov/vuln/detail/CVE-2019-17571]

This package is shipped within `jars/log4j-1.2.17.jar`




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10388) Public dataset loader interface

2021-05-25 Thread Gaurav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351097#comment-17351097
 ] 

Gaurav Kumar commented on SPARK-10388:
--

I want to work on this issue [~mengxr], yet I am new to opensource. I would 
love to hear from you.

> Public dataset loader interface
> ---
>
> Key: SPARK-10388
> URL: https://issues.apache.org/jira/browse/SPARK-10388
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Major
> Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf
>
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35517) Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar

2021-05-25 Thread Louis DEFLANDRE (Jira)
Louis DEFLANDRE created SPARK-35517:
---

 Summary: Vulnerabilities: jackson-databind 2.4.0 shipped with 
htrace-core4-4.1.0-incubating.jar
 Key: SPARK-35517
 URL: https://issues.apache.org/jira/browse/SPARK-35517
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.2
Reporter: Louis DEFLANDRE


Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
`spark-3.0.2-bin-hadoop3.2` coming from obsolete `jackson-databind` 2.4.0 :

* [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489]
* 
[CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718]

This package is shipped within `jars/htrace-core4-4.1.0-incubating.jar`





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35516) Storage UI tab Storage Level tool tip correction

2021-05-25 Thread jobit mathew (Jira)
jobit mathew created SPARK-35516:


 Summary: Storage UI tab Storage Level tool tip correction
 Key: SPARK-35516
 URL: https://issues.apache.org/jira/browse/SPARK-35516
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.1.1
Reporter: jobit mathew


Storage UI tab Storage Level tool tip correction required.
||
| ||
| ||
| |Storage Level|
| ||

please change *andreplication * to *and replication*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35396) Support to manual close/release entries in MemoryStore and InMemoryRelation instead of replying on GC

2021-05-25 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-35396:
-
Issue Type: Improvement  (was: New Feature)
  Priority: Minor  (was: Major)

> Support to manual close/release entries in MemoryStore and InMemoryRelation 
> instead of replying on GC
> -
>
> Key: SPARK-35396
> URL: https://issues.apache.org/jira/browse/SPARK-35396
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Chendi.Xue
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> This PR is proposing a add-on to support to manual close entries in 
> MemoryStore and InMemoryRelation
> h3. What changes were proposed in this pull request?
> Currently:
> MemoryStore uses a LinkedHashMap[BlockId, MemoryEntry[_]] to store all OnHeap 
> or OffHeap entries.
> And when memoryStore.remove(blockId) is called, codes will simply remove one 
> entry from LinkedHashMap and leverage Java GC to do release work.
> This PR:
> We are proposing a add-on to manually close any object stored in MemoryStore 
> and InMemoryRelation if this object is extended from AutoCloseable.
> Veifiication:
> In our own use case, we implemented a user-defined off-heap-hashRelation for 
> BHJ, and we verified that by adding this manual close, we can make sure our 
> defined off-heap-hashRelation can be released when evict is called.
> Also, we implemented user-defined cachedBatch and will be release when 
> InMemoryRelation.clearCache() is called by this PR
> h3. Why are the changes needed?
> This changes can help to clean some off-heap user-defined object may be 
> cached in InMemoryRelation or MemoryStore
> h3. Does this PR introduce _any_ user-facing change?
> NO
> h3. How was this patch tested?
> WIP
> Signed-off-by: Chendi Xue [chendi@intel.com|mailto:chendi@intel.com]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35396) Support to manual close/release entries in MemoryStore and InMemoryRelation instead of replying on GC

2021-05-25 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-35396:


Assignee: Apache Spark

> Support to manual close/release entries in MemoryStore and InMemoryRelation 
> instead of replying on GC
> -
>
> Key: SPARK-35396
> URL: https://issues.apache.org/jira/browse/SPARK-35396
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Chendi.Xue
>Assignee: Apache Spark
>Priority: Major
>
> This PR is proposing a add-on to support to manual close entries in 
> MemoryStore and InMemoryRelation
> h3. What changes were proposed in this pull request?
> Currently:
> MemoryStore uses a LinkedHashMap[BlockId, MemoryEntry[_]] to store all OnHeap 
> or OffHeap entries.
> And when memoryStore.remove(blockId) is called, codes will simply remove one 
> entry from LinkedHashMap and leverage Java GC to do release work.
> This PR:
> We are proposing a add-on to manually close any object stored in MemoryStore 
> and InMemoryRelation if this object is extended from AutoCloseable.
> Veifiication:
> In our own use case, we implemented a user-defined off-heap-hashRelation for 
> BHJ, and we verified that by adding this manual close, we can make sure our 
> defined off-heap-hashRelation can be released when evict is called.
> Also, we implemented user-defined cachedBatch and will be release when 
> InMemoryRelation.clearCache() is called by this PR
> h3. Why are the changes needed?
> This changes can help to clean some off-heap user-defined object may be 
> cached in InMemoryRelation or MemoryStore
> h3. Does this PR introduce _any_ user-facing change?
> NO
> h3. How was this patch tested?
> WIP
> Signed-off-by: Chendi Xue [chendi@intel.com|mailto:chendi@intel.com]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35396) Support to manual close/release entries in MemoryStore and InMemoryRelation instead of replying on GC

2021-05-25 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35396.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32534
[https://github.com/apache/spark/pull/32534]

> Support to manual close/release entries in MemoryStore and InMemoryRelation 
> instead of replying on GC
> -
>
> Key: SPARK-35396
> URL: https://issues.apache.org/jira/browse/SPARK-35396
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Chendi.Xue
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> This PR is proposing a add-on to support to manual close entries in 
> MemoryStore and InMemoryRelation
> h3. What changes were proposed in this pull request?
> Currently:
> MemoryStore uses a LinkedHashMap[BlockId, MemoryEntry[_]] to store all OnHeap 
> or OffHeap entries.
> And when memoryStore.remove(blockId) is called, codes will simply remove one 
> entry from LinkedHashMap and leverage Java GC to do release work.
> This PR:
> We are proposing a add-on to manually close any object stored in MemoryStore 
> and InMemoryRelation if this object is extended from AutoCloseable.
> Veifiication:
> In our own use case, we implemented a user-defined off-heap-hashRelation for 
> BHJ, and we verified that by adding this manual close, we can make sure our 
> defined off-heap-hashRelation can be released when evict is called.
> Also, we implemented user-defined cachedBatch and will be release when 
> InMemoryRelation.clearCache() is called by this PR
> h3. Why are the changes needed?
> This changes can help to clean some off-heap user-defined object may be 
> cached in InMemoryRelation or MemoryStore
> h3. Does this PR introduce _any_ user-facing change?
> NO
> h3. How was this patch tested?
> WIP
> Signed-off-by: Chendi Xue [chendi@intel.com|mailto:chendi@intel.com]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35447) optimize skew join before coalescing shuffle partitions

2021-05-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35447:
---

Assignee: Wenchen Fan

> optimize skew join before coalescing shuffle partitions
> ---
>
> Key: SPARK-35447
> URL: https://issues.apache.org/jira/browse/SPARK-35447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35447) optimize skew join before coalescing shuffle partitions

2021-05-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35447.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32594
[https://github.com/apache/spark/pull/32594]

> optimize skew join before coalescing shuffle partitions
> ---
>
> Key: SPARK-35447
> URL: https://issues.apache.org/jira/browse/SPARK-35447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29223) Kafka source: offset by timestamp - allow specifying timestamp for "all partitions"

2021-05-25 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-29223.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32609
[https://github.com/apache/spark/pull/32609]

> Kafka source: offset by timestamp - allow specifying timestamp for "all 
> partitions"
> ---
>
> Key: SPARK-29223
> URL: https://issues.apache.org/jira/browse/SPARK-29223
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.2.0
>
>
> This issue is a follow-up of SPARK-26848.
> In SPARK-26848, we decided to open possibility to let end users set 
> individual timestamp per partition. But in many cases, specifying timestamp 
> represents the intention that we would want to go back to specific timestamp 
> and reprocess records, which should be applied to all topics and partitions.
> According to the format of 
> `startingOffsetsByTimestamp`/`endingOffsetsByTimestamp`, while it's not 
> intuitive to provide an option to set a global timestamp across topic, it's 
> still intuitive to provide an option to set a global timestamp across 
> partitions in a topic.
> This issue tracks the efforts to deal with this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29223) Kafka source: offset by timestamp - allow specifying timestamp for "all partitions"

2021-05-25 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-29223:


Assignee: Jungtaek Lim

> Kafka source: offset by timestamp - allow specifying timestamp for "all 
> partitions"
> ---
>
> Key: SPARK-29223
> URL: https://issues.apache.org/jira/browse/SPARK-29223
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
>
> This issue is a follow-up of SPARK-26848.
> In SPARK-26848, we decided to open possibility to let end users set 
> individual timestamp per partition. But in many cases, specifying timestamp 
> represents the intention that we would want to go back to specific timestamp 
> and reprocess records, which should be applied to all topics and partitions.
> According to the format of 
> `startingOffsetsByTimestamp`/`endingOffsetsByTimestamp`, while it's not 
> intuitive to provide an option to set a global timestamp across topic, it's 
> still intuitive to provide an option to set a global timestamp across 
> partitions in a topic.
> This issue tracks the efforts to deal with this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35504) count distinct asterisk

2021-05-25 Thread Nikolay Sokolov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay Sokolov resolved SPARK-35504.
-
Resolution: Fixed

I could not fully comprehend what was written in the documentation.

Helped to figure it out.

> count distinct asterisk 
> 
>
> Key: SPARK-35504
> URL: https://issues.apache.org/jira/browse/SPARK-35504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: {code:java}
> uname -a
> Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>  
> {code:java}
> lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 18.04.4 LTS
> Release:  18.04
> Codename: bionic
> {code}
>  
> {code:java}
> /opt/spark/bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0
>   /_/
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
> Branch HEAD
> Compiled by user ubuntu on 2020-06-06T13:05:28Z
> Revision 3fdfce3120f307147244e5eaf46d61419a723d50
> Url https://gitbox.apache.org/repos/asf/spark.git
> Type --help for more information.
> {code}
> {code:java}
> lscpu
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> CPU(s):  4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):   1
> NUMA node(s):1
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   85
> Model name:  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> Stepping:7
> CPU MHz: 3602.011
> BogoMIPS:6000.01
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:1024K
> L3 cache:36608K
> NUMA node0 CPU(s):   0-3
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
> constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
> tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
> popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
> 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
> invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
> {code}
>  
>Reporter: Nikolay Sokolov
>Priority: Minor
> Attachments: SPARK-35504_first_query_plan.log, 
> SPARK-35504_second_query_plan.log
>
>
> Hi everyone,
> I hope you're well!
>  
> Today I came across a very interesting case when the result of the execution 
> of the algorithm for counting unique rows differs depending on the form 
> (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.
> I still can't figure out on my own if this is a bug or a feature and I would 
> like to share what I found.
>  
> I run Spark SQL queries through the Thrift (and not only) connecting to the 
> Spark cluster. I use the DBeaver app to execute Spark SQL queries.
>  
> So, I have two identical Spark SQL queries from an algorithmic point of view 
> that return different results.
>  
> The first query:
> {code:sql}
> select count(distinct *) unique_amt from storage_datamart.olympiads
> ; -- Rows: 13437678
> {code}
>  
> The second query:
> {code:sql}
> select count(*) from (select distinct * from storage_datamart.olympiads)
> ; -- Rows: 36901430
> {code}
>  
> The result of the two queries is different. (But it must be the same, right!?)
> {code:sql}
> select 'The first query' description, count(distinct *) unique_amt from 
> storage_datamart.olympiads
>  union all
> select 'The second query', count(*) from (select distinct * from 
> storage_datamart.olympiads)
> ;
> {code}
>  
> The result of the above query is the following:
> {code:java}
> The first query13437678
> The second query   36901430
> {code}
>  
>  I can easily calculate the unique number of rows in the table:
> {code:sql}
> select count(*) from (
>   select student_id, olympiad_id, tour, grade
> from storage_datamart.olympiads
>group by student_id, olympiad_id, tour, grade
>   having count(*) = 1
> )
> ; -- Rows: 36901365
> {code}
>  
> The table DDL is the following:
> {code:sql}
> CREATE TABLE `storage_datamart`.`olympiads` (
>   `ptn_date` DATE,
>   `student_id` BIGINT,
>   `olympiad_id` STRING,
>   `grade` BIGINT,
>   `grade_type` STRING,
>   `tour` STRING,
>   `created_at` TIMESTAMP,
>   `created_at_local` TIMESTAMP,
>   `olympiad_num` BIGINT,
>   

[jira] [Comment Edited] (SPARK-35504) count distinct asterisk

2021-05-25 Thread Nikolay Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351018#comment-17351018
 ] 

Nikolay Sokolov edited comment on SPARK-35504 at 5/25/21, 12:35 PM:


Subtracted from the number of all rows of the table the number of rows 
containing NULL values in at least one column and got what I was looking for:

 
{code:sql}
select (select count(1) amt from storage_datamart.olympiads) - 
(
select count(1)
  from storage_datamart.olympiads
 where ptn_date is null
or student_id is null
or olympiad_id is null
or grade is null
or grade_type is null
or tour is null
or created_at is null
or created_at_local is null
or olympiad_num is null
or olympiad_name is null
or subject is null
or started_at is null
or ended_at is null
or region_id is null
or region_name is null
or municipality_name is null
or school_id is null
or school_name is null
or school_status is null
or oly_n_common is null
or num_day is null
or award_type is null
or new_student_legacy is null
or segment is null
or total_start is null
or total_end is null
or year_learn is null
or parent_id is null
or teacher_id is null
or parallel is null
or olympiad_type is null
)
;  -- 13437678
{code}
{code:sql}
select amt - 23463820 from (
select count(1) amt
  from storage_datamart.olympiads
)
;  -- 13437678
{code}
 

This is a feature that is documented.

I apologize.

I'll close this task.

 

Thank you!
  


was (Author: melchizedek13):
Subtracted from the number of all rows of the table the number of rows 
containing NULL values in at least one column and got what I was looking for:

 
{code:sql}
select (select count(1) amt from storage_datamart.olympiads) - 
(
select count(1)
  from storage_datamart.olympiads
 where ptn_date is null
or student_id is null
or olympiad_id is null
or grade is null
or grade_type is null
or tour is null
or created_at is null
or created_at_local is null
or olympiad_num is null
or olympiad_name is null
or subject is null
or started_at is null
or ended_at is null
or region_id is null
or region_name is null
or municipality_name is null
or school_id is null
or school_name is null
or school_status is null
or oly_n_common is null
or num_day is null
or award_type is null
or new_student_legacy is null
or segment is null
or total_start is null
or total_end is null
or year_learn is null
or parent_id is null
or teacher_id is null
or parallel is null
or olympiad_type is null
)
;  -- 13437678
{code}
{code:sql}
select amt - 23463820 from (
select count(1) amt
  from storage_datamart.olympiads
)
;  -- 13437678
{code}
 

Apparently this is a feature that is not documented.

I'll wait a day and close this task.
  

> count distinct asterisk 
> 
>
> Key: SPARK-35504
> URL: https://issues.apache.org/jira/browse/SPARK-35504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: {code:java}
> uname -a
> Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>  
> {code:java}
> lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 18.04.4 LTS
> Release:  18.04
> Codename: bionic
> {code}
>  
> {code:java}
> /opt/spark/bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0
>   /_/
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
> Branch HEAD
> Compiled by user ubuntu on 2020-06-06T13:05:28Z
> Revision 3fdfce3120f307147244e5eaf46d61419a723d50
> Url https://gitbox.apache.org/repos/asf/spark.git
> Type --help for more information.
> {code}
> {code:java}
> lscpu
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> CPU(s):  4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):   1
> NUMA node(s):1
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   85
> Model name:  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> Stepping:7
> CPU MHz: 3602.011
> BogoMIPS:6000.01
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:1024K
> L3 cache:36608K
> NUMA node0 CPU(s):   0-3
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx 

[jira] [Assigned] (SPARK-35514) Automatically update version index of DocSearch via release-tag.sh

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35514:


Assignee: Apache Spark  (was: Gengliang Wang)

> Automatically update version index of DocSearch via release-tag.sh
> --
>
> Key: SPARK-35514
> URL: https://issues.apache.org/jira/browse/SPARK-35514
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Automatically update version index of DocSearch via release-tag.sh for 
> releasing new documentation site, instead of the current manual update.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35514) Automatically update version index of DocSearch via release-tag.sh

2021-05-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351021#comment-17351021
 ] 

Apache Spark commented on SPARK-35514:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32662

> Automatically update version index of DocSearch via release-tag.sh
> --
>
> Key: SPARK-35514
> URL: https://issues.apache.org/jira/browse/SPARK-35514
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Automatically update version index of DocSearch via release-tag.sh for 
> releasing new documentation site, instead of the current manual update.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35514) Automatically update version index of DocSearch via release-tag.sh

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35514:


Assignee: Gengliang Wang  (was: Apache Spark)

> Automatically update version index of DocSearch via release-tag.sh
> --
>
> Key: SPARK-35514
> URL: https://issues.apache.org/jira/browse/SPARK-35514
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Automatically update version index of DocSearch via release-tag.sh for 
> releasing new documentation site, instead of the current manual update.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35504) count distinct asterisk

2021-05-25 Thread Nikolay Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351018#comment-17351018
 ] 

Nikolay Sokolov commented on SPARK-35504:
-

Subtracted from the number of all rows of the table the number of rows 
containing NULL values in at least one column and got what I was looking for:

 
{code:sql}
select (select count(1) amt from storage_datamart.olympiads) - 
(
select count(1)
  from storage_datamart.olympiads
 where ptn_date is null
or student_id is null
or olympiad_id is null
or grade is null
or grade_type is null
or tour is null
or created_at is null
or created_at_local is null
or olympiad_num is null
or olympiad_name is null
or subject is null
or started_at is null
or ended_at is null
or region_id is null
or region_name is null
or municipality_name is null
or school_id is null
or school_name is null
or school_status is null
or oly_n_common is null
or num_day is null
or award_type is null
or new_student_legacy is null
or segment is null
or total_start is null
or total_end is null
or year_learn is null
or parent_id is null
or teacher_id is null
or parallel is null
or olympiad_type is null
)
;  -- 13437678
{code}
{code:sql}
select amt - 23463820 from (
select count(1) amt
  from storage_datamart.olympiads
)
;  -- 13437678
{code}
 

Apparently this is a feature that is not documented.

I'll wait a day and close this task.
  

> count distinct asterisk 
> 
>
> Key: SPARK-35504
> URL: https://issues.apache.org/jira/browse/SPARK-35504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: {code:java}
> uname -a
> Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>  
> {code:java}
> lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 18.04.4 LTS
> Release:  18.04
> Codename: bionic
> {code}
>  
> {code:java}
> /opt/spark/bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0
>   /_/
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
> Branch HEAD
> Compiled by user ubuntu on 2020-06-06T13:05:28Z
> Revision 3fdfce3120f307147244e5eaf46d61419a723d50
> Url https://gitbox.apache.org/repos/asf/spark.git
> Type --help for more information.
> {code}
> {code:java}
> lscpu
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> CPU(s):  4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):   1
> NUMA node(s):1
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   85
> Model name:  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> Stepping:7
> CPU MHz: 3602.011
> BogoMIPS:6000.01
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:1024K
> L3 cache:36608K
> NUMA node0 CPU(s):   0-3
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
> constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
> tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
> popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
> 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
> invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
> {code}
>  
>Reporter: Nikolay Sokolov
>Priority: Minor
> Attachments: SPARK-35504_first_query_plan.log, 
> SPARK-35504_second_query_plan.log
>
>
> Hi everyone,
> I hope you're well!
>  
> Today I came across a very interesting case when the result of the execution 
> of the algorithm for counting unique rows differs depending on the form 
> (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.
> I still can't figure out on my own if this is a bug or a feature and I would 
> like to share what I found.
>  
> I run Spark SQL queries through the Thrift (and not only) connecting to the 
> Spark cluster. I use the DBeaver app to execute Spark SQL queries.
>  
> So, I have two identical Spark SQL queries from an algorithmic point of view 
> that return different results.
>  
> The first query:
> {code:sql}
> select count(distinct *) unique_amt from storage_datamart.olympiads
> ; -- Rows: 

[jira] [Commented] (SPARK-35515) TimestampType: OverflowError: mktime argument out of range

2021-05-25 Thread Martin Studer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351015#comment-17351015
 ] 

Martin Studer commented on SPARK-35515:
---

I'm happy to provide a PR if this seems like a sensible improvement.

> TimestampType: OverflowError: mktime argument out of range 
> ---
>
> Key: SPARK-35515
> URL: https://issues.apache.org/jira/browse/SPARK-35515
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: Martin Studer
>Priority: Minor
>
> This issue occurs, for example, when trying to create a data frame from 
> Python {{datetime}} objects that are "out of range" where "out of range" is 
> platform-dependent due to the use of 
> [{{time.mktime}}|https://docs.python.org/3/library/time.html#time.mktime] in 
> {{TimestampType.toInternal}}:
> {code}
> import datetime
> spark_session.createDataFrame([(datetime.datetime(, 12, 31, 0, 0),)])
> {code}
> A more direct way to reproduce the issue is by invoking 
> {{TimestampType.toInternal}} directly:
> {code}
> import datetime
> from pyspark.sql.types import TimestampType
> dt = datetime.datetime(, 12, 31, 0, 0)
> TimestampType().toInternal(dt)
> {code}
> The suggested improvement is to avoid using {{time.mktime}} to increase the 
> range of {{datetime}} values. A possible implementation may look as follows:
> {code}
> import datetime
> import pytz
> EPOCH_UTC = datetime.datetime(1970, 1, 1).replace(tzinfo=pytz.utc)
> LOCAL_TZ = datetime.datetime.now().astimezone().tzinfo
> def toInternal(dt):
>   if dt is not None:
>   dt = dt if dt.tzinfo else dt.replace(tzinfo=LOCAL_TZ)
>   dt_utc = dt.astimezone(pytz.utc)
>   td = dt_utc - EPOCH_UTC
>   return (td.days * 86400 + td.seconds) * 10 ** 6 + 
> td.microseconds
> {code}
> This relies on the ability to derive the local timezone. Other mechanisms may 
> be used to what is suggested above.
> Test cases include:
> {code}
> dt1 = datetime.datetime(2021, 5, 25, 12, 23)
> dt2 = dt1.replace(tzinfo=pytz.timezone('Europe/Zurich'))
> dt3 = datetime.datetime(, 12, 31, 0, 0)
> dt4 = dt3.replace(tzinfo=pytz.timezone('Europe/Zurich'))
> toInternal(dt1) == TimestampType().toInternal(dt1)
> toInternal(dt2) == TimestampType().toInternal(dt2)
> toInternal(dt3) # TimestampType().toInternal(dt3) fails
> toInternal(dt4) == TimestampType().toInternal(dt4)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33121) Spark Streaming 3.1.1 hangs on shutdown

2021-05-25 Thread Dmitry Tverdokhleb (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351012#comment-17351012
 ] 

Dmitry Tverdokhleb commented on SPARK-33121:


L. C. Hsieh, have you tested this case with sending SIGTERM signal when "for 
each" operation entered in sleeping mode?

> Spark Streaming 3.1.1 hangs on shutdown
> ---
>
> Key: SPARK-33121
> URL: https://issues.apache.org/jira/browse/SPARK-33121
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.1.1
>Reporter: Dmitry Tverdokhleb
>Priority: Major
>  Labels: Streaming, hang, shutdown
>
> Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem 
> in graceful shutdown.
> Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".
> Here is the code:
> {code:java}
> inputStream.foreachRDD {
>   rdd =>
> rdd.foreachPartition {
> Thread.sleep(5000)
> }
> }
> {code}
> I send a SIGTERM signal to stop the spark streaming and after sleeping an 
> exception arises:
> {noformat}
> streaming-agg-tds-data_1  | java.util.concurrent.RejectedExecutionException: 
> Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from 
> java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach$(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach(IterableLike.scala:74)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> streaming-agg-tds-data_1  | at java.lang.Thread.run(Thread.java:748)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 WARN  JobGenerator - Timed 
> out while stopping the job generator (timeout = 1)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Waited 
> for jobs to be processed and checkpoints to be written
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Stopped 
> JobGenerator{noformat}
> After this exception and "JobGenerator - Stopped JobGenerator" log, streaming 
> freezes, and halts by timeout (Config parameter 
> "hadoop.service.shutdown.timeout").
> Besides, there is no problem with the graceful shutdown in spark 2.4.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35515) TimestampType: OverflowError: mktime argument out of range

2021-05-25 Thread Martin Studer (Jira)
Martin Studer created SPARK-35515:
-

 Summary: TimestampType: OverflowError: mktime argument out of 
range 
 Key: SPARK-35515
 URL: https://issues.apache.org/jira/browse/SPARK-35515
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.1
Reporter: Martin Studer


This issue occurs, for example, when trying to create a data frame from Python 
{{datetime}} objects that are "out of range" where "out of range" is 
platform-dependent due to the use of 
[{{time.mktime}}|https://docs.python.org/3/library/time.html#time.mktime] in 
{{TimestampType.toInternal}}:

{code}
import datetime
spark_session.createDataFrame([(datetime.datetime(, 12, 31, 0, 0),)])
{code}

A more direct way to reproduce the issue is by invoking 
{{TimestampType.toInternal}} directly:
{code}
import datetime
from pyspark.sql.types import TimestampType
dt = datetime.datetime(, 12, 31, 0, 0)
TimestampType().toInternal(dt)
{code}

The suggested improvement is to avoid using {{time.mktime}} to increase the 
range of {{datetime}} values. A possible implementation may look as follows:

{code}
import datetime
import pytz

EPOCH_UTC = datetime.datetime(1970, 1, 1).replace(tzinfo=pytz.utc)
LOCAL_TZ = datetime.datetime.now().astimezone().tzinfo

def toInternal(dt):
if dt is not None:
dt = dt if dt.tzinfo else dt.replace(tzinfo=LOCAL_TZ)
dt_utc = dt.astimezone(pytz.utc)
td = dt_utc - EPOCH_UTC
return (td.days * 86400 + td.seconds) * 10 ** 6 + 
td.microseconds
{code}

This relies on the ability to derive the local timezone. Other mechanisms may 
be used to what is suggested above.

Test cases include:
{code}
dt1 = datetime.datetime(2021, 5, 25, 12, 23)
dt2 = dt1.replace(tzinfo=pytz.timezone('Europe/Zurich'))
dt3 = datetime.datetime(, 12, 31, 0, 0)
dt4 = dt3.replace(tzinfo=pytz.timezone('Europe/Zurich'))

toInternal(dt1) == TimestampType().toInternal(dt1)
toInternal(dt2) == TimestampType().toInternal(dt2)
toInternal(dt3) # TimestampType().toInternal(dt3) fails
toInternal(dt4) == TimestampType().toInternal(dt4)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35504) count distinct asterisk

2021-05-25 Thread Nikolay Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351010#comment-17351010
 ] 

Nikolay Sokolov commented on SPARK-35504:
-

It's really close to true:

{code:sql}
select 36901430 - 23463820
-- 13437610
{code}

[~hyukjin.kwon] thank you!
  

> count distinct asterisk 
> 
>
> Key: SPARK-35504
> URL: https://issues.apache.org/jira/browse/SPARK-35504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: {code:java}
> uname -a
> Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>  
> {code:java}
> lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 18.04.4 LTS
> Release:  18.04
> Codename: bionic
> {code}
>  
> {code:java}
> /opt/spark/bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0
>   /_/
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
> Branch HEAD
> Compiled by user ubuntu on 2020-06-06T13:05:28Z
> Revision 3fdfce3120f307147244e5eaf46d61419a723d50
> Url https://gitbox.apache.org/repos/asf/spark.git
> Type --help for more information.
> {code}
> {code:java}
> lscpu
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> CPU(s):  4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):   1
> NUMA node(s):1
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   85
> Model name:  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> Stepping:7
> CPU MHz: 3602.011
> BogoMIPS:6000.01
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:1024K
> L3 cache:36608K
> NUMA node0 CPU(s):   0-3
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
> constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
> tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
> popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
> 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
> invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
> {code}
>  
>Reporter: Nikolay Sokolov
>Priority: Minor
> Attachments: SPARK-35504_first_query_plan.log, 
> SPARK-35504_second_query_plan.log
>
>
> Hi everyone,
> I hope you're well!
>  
> Today I came across a very interesting case when the result of the execution 
> of the algorithm for counting unique rows differs depending on the form 
> (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.
> I still can't figure out on my own if this is a bug or a feature and I would 
> like to share what I found.
>  
> I run Spark SQL queries through the Thrift (and not only) connecting to the 
> Spark cluster. I use the DBeaver app to execute Spark SQL queries.
>  
> So, I have two identical Spark SQL queries from an algorithmic point of view 
> that return different results.
>  
> The first query:
> {code:sql}
> select count(distinct *) unique_amt from storage_datamart.olympiads
> ; -- Rows: 13437678
> {code}
>  
> The second query:
> {code:sql}
> select count(*) from (select distinct * from storage_datamart.olympiads)
> ; -- Rows: 36901430
> {code}
>  
> The result of the two queries is different. (But it must be the same, right!?)
> {code:sql}
> select 'The first query' description, count(distinct *) unique_amt from 
> storage_datamart.olympiads
>  union all
> select 'The second query', count(*) from (select distinct * from 
> storage_datamart.olympiads)
> ;
> {code}
>  
> The result of the above query is the following:
> {code:java}
> The first query13437678
> The second query   36901430
> {code}
>  
>  I can easily calculate the unique number of rows in the table:
> {code:sql}
> select count(*) from (
>   select student_id, olympiad_id, tour, grade
> from storage_datamart.olympiads
>group by student_id, olympiad_id, tour, grade
>   having count(*) = 1
> )
> ; -- Rows: 36901365
> {code}
>  
> The table DDL is the following:
> {code:sql}
> CREATE TABLE `storage_datamart`.`olympiads` (
>   `ptn_date` DATE,
>   `student_id` BIGINT,
>   `olympiad_id` STRING,
>   `grade` BIGINT,
>   `grade_type` STRING,
>   `tour` STRING,
>   `created_at` TIMESTAMP,
>   `created_at_local` TIMESTAMP,
>  

[jira] [Comment Edited] (SPARK-35504) count distinct asterisk

2021-05-25 Thread Nikolay Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351008#comment-17351008
 ] 

Nikolay Sokolov edited comment on SPARK-35504 at 5/25/21, 11:53 AM:


[~hyukjin.kwon] thanks for the hint!

 

I've just counted any nulls column's value by using the following script:
{code:sql}
select count(1)
  from storage_datamart.olympiads
 where ptn_date is null
or student_id is null
or olympiad_id is null
or grade is null
or grade_type is null
or tour is null
or created_at is null
or created_at_local is null
or olympiad_num is null
or olympiad_name is null
or subject is null
or started_at is null
or ended_at is null
or region_id is null
or region_name is null
or municipality_name is null
or school_id is null
or school_name is null
or school_status is null
or oly_n_common is null
or num_day is null
or award_type is null
or new_student_legacy is null
or segment is null
or total_start is null
or total_end is null
or year_learn is null
or parent_id is null
or teacher_id is null
or parallel is null
or olympiad_type is null
;
{code}
 

I've got 23463820 rows.


was (Author: melchizedek13):
[~hyukjin.kwon] thanks for the hint!

 

I've just counted any nulls column's value by using the following script:
{code:sql}
select count(1)
  from storage_datamart.olympiads
 where ptn_date is null
or student_id is null
or olympiad_id is null
or grade is null
or grade_type is null
or tour is null
or created_at is null
or created_at_local is null
or olympiad_num is null
or olympiad_name is null
or subject is null
or started_at is null
or ended_at is null
or region_id is null
or region_name is null
or municipality_name is null
or school_id is null
or school_name is null
or school_status is null
or oly_n_common is null
or num_day is null
or award_type is null
or new_student_legacy is null
or segment is null
or total_start is null
or total_end is null
or year_learn is null
or parent_id is null
or teacher_id is null
or parallel is null
or olympiad_type is null
;
{code}
I've got 23463820 rows.

 

This value differs from 13437678 & 36901430.

> count distinct asterisk 
> 
>
> Key: SPARK-35504
> URL: https://issues.apache.org/jira/browse/SPARK-35504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: {code:java}
> uname -a
> Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>  
> {code:java}
> lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 18.04.4 LTS
> Release:  18.04
> Codename: bionic
> {code}
>  
> {code:java}
> /opt/spark/bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0
>   /_/
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
> Branch HEAD
> Compiled by user ubuntu on 2020-06-06T13:05:28Z
> Revision 3fdfce3120f307147244e5eaf46d61419a723d50
> Url https://gitbox.apache.org/repos/asf/spark.git
> Type --help for more information.
> {code}
> {code:java}
> lscpu
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> CPU(s):  4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):   1
> NUMA node(s):1
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   85
> Model name:  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> Stepping:7
> CPU MHz: 3602.011
> BogoMIPS:6000.01
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:1024K
> L3 cache:36608K
> NUMA node0 CPU(s):   0-3
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
> constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
> tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
> popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
> 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
> invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
> {code}
>  
>Reporter: Nikolay Sokolov
>Priority: Minor
> Attachments: 

[jira] [Created] (SPARK-35514) Automatically update version index of DocSearch via release-tag.sh

2021-05-25 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-35514:
--

 Summary: Automatically update version index of DocSearch via 
release-tag.sh
 Key: SPARK-35514
 URL: https://issues.apache.org/jira/browse/SPARK-35514
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Automatically update version index of DocSearch via release-tag.sh for 
releasing new documentation site, instead of the current manual update.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35504) count distinct asterisk

2021-05-25 Thread Nikolay Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351008#comment-17351008
 ] 

Nikolay Sokolov commented on SPARK-35504:
-

[~hyukjin.kwon] thanks for the hint!

 

I've just counted any nulls column's value by using the following script:
{code:sql}
select count(1)
  from storage_datamart.olympiads
 where ptn_date is null
or student_id is null
or olympiad_id is null
or grade is null
or grade_type is null
or tour is null
or created_at is null
or created_at_local is null
or olympiad_num is null
or olympiad_name is null
or subject is null
or started_at is null
or ended_at is null
or region_id is null
or region_name is null
or municipality_name is null
or school_id is null
or school_name is null
or school_status is null
or oly_n_common is null
or num_day is null
or award_type is null
or new_student_legacy is null
or segment is null
or total_start is null
or total_end is null
or year_learn is null
or parent_id is null
or teacher_id is null
or parallel is null
or olympiad_type is null
;
{code}
I've got 23463820 rows.

 

This value differs from 13437678 & 36901430.

> count distinct asterisk 
> 
>
> Key: SPARK-35504
> URL: https://issues.apache.org/jira/browse/SPARK-35504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: {code:java}
> uname -a
> Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>  
> {code:java}
> lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 18.04.4 LTS
> Release:  18.04
> Codename: bionic
> {code}
>  
> {code:java}
> /opt/spark/bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0
>   /_/
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
> Branch HEAD
> Compiled by user ubuntu on 2020-06-06T13:05:28Z
> Revision 3fdfce3120f307147244e5eaf46d61419a723d50
> Url https://gitbox.apache.org/repos/asf/spark.git
> Type --help for more information.
> {code}
> {code:java}
> lscpu
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> CPU(s):  4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):   1
> NUMA node(s):1
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   85
> Model name:  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> Stepping:7
> CPU MHz: 3602.011
> BogoMIPS:6000.01
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:1024K
> L3 cache:36608K
> NUMA node0 CPU(s):   0-3
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
> constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
> tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
> popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
> 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
> invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
> {code}
>  
>Reporter: Nikolay Sokolov
>Priority: Minor
> Attachments: SPARK-35504_first_query_plan.log, 
> SPARK-35504_second_query_plan.log
>
>
> Hi everyone,
> I hope you're well!
>  
> Today I came across a very interesting case when the result of the execution 
> of the algorithm for counting unique rows differs depending on the form 
> (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.
> I still can't figure out on my own if this is a bug or a feature and I would 
> like to share what I found.
>  
> I run Spark SQL queries through the Thrift (and not only) connecting to the 
> Spark cluster. I use the DBeaver app to execute Spark SQL queries.
>  
> So, I have two identical Spark SQL queries from an algorithmic point of view 
> that return different results.
>  
> The first query:
> {code:sql}
> select count(distinct *) unique_amt from storage_datamart.olympiads
> ; -- Rows: 13437678
> {code}
>  
> The second query:
> {code:sql}
> select count(*) from (select distinct * from storage_datamart.olympiads)
> ; -- Rows: 36901430
> {code}
>  
> The result of the two queries is different. (But it must be the same, right!?)
> {code:sql}
> select 

[jira] [Commented] (SPARK-35513) Upgrade joda-time to 2.10.10

2021-05-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351004#comment-17351004
 ] 

Apache Spark commented on SPARK-35513:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/32661

> Upgrade joda-time to 2.10.10
> 
>
> Key: SPARK-35513
> URL: https://issues.apache.org/jira/browse/SPARK-35513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Vinod KC
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35513) Upgrade joda-time to 2.10.10

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35513:


Assignee: (was: Apache Spark)

> Upgrade joda-time to 2.10.10
> 
>
> Key: SPARK-35513
> URL: https://issues.apache.org/jira/browse/SPARK-35513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Vinod KC
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35513) Upgrade joda-time to 2.10.10

2021-05-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35513:


Assignee: Apache Spark

> Upgrade joda-time to 2.10.10
> 
>
> Key: SPARK-35513
> URL: https://issues.apache.org/jira/browse/SPARK-35513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Vinod KC
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35513) Upgrade joda-time to 2.10.10

2021-05-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351002#comment-17351002
 ] 

Apache Spark commented on SPARK-35513:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/32661

> Upgrade joda-time to 2.10.10
> 
>
> Key: SPARK-35513
> URL: https://issues.apache.org/jira/browse/SPARK-35513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Vinod KC
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >