[jira] [Assigned] (SPARK-33765) Migrate UNCACHE TABLE to new resolution framework

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33765:


Assignee: (was: Apache Spark)

> Migrate UNCACHE TABLE to new resolution framework
> -
>
> Key: SPARK-33765
> URL: https://issues.apache.org/jira/browse/SPARK-33765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate UNCACHE TABLE to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33765) Migrate UNCACHE TABLE to new resolution framework

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248283#comment-17248283
 ] 

Apache Spark commented on SPARK-33765:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30743

> Migrate UNCACHE TABLE to new resolution framework
> -
>
> Key: SPARK-33765
> URL: https://issues.apache.org/jira/browse/SPARK-33765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate UNCACHE TABLE to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33765) Migrate UNCACHE TABLE to new resolution framework

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33765:


Assignee: Apache Spark

> Migrate UNCACHE TABLE to new resolution framework
> -
>
> Key: SPARK-33765
> URL: https://issues.apache.org/jira/browse/SPARK-33765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> Migrate UNCACHE TABLE to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33765) Migrate UNCACHE TABLE to new resolution framework

2020-12-11 Thread Terry Kim (Jira)
Terry Kim created SPARK-33765:
-

 Summary: Migrate UNCACHE TABLE to new resolution framework
 Key: SPARK-33765
 URL: https://issues.apache.org/jira/browse/SPARK-33765
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Terry Kim


Migrate UNCACHE TABLE to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33762) Bump commons-codec to latest version.

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248273#comment-17248273
 ] 

Apache Spark commented on SPARK-33762:
--

User 'n-marion' has created a pull request for this issue:
https://github.com/apache/spark/pull/30740

> Bump commons-codec to latest version. 
> --
>
> Key: SPARK-33762
> URL: https://issues.apache.org/jira/browse/SPARK-33762
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Nicholas Marion
>Priority: Major
>
> Currently Spark pulls in commons-codec version 2.10 which was released 6 
> years ago. Some Open Source scans have found a possible encoding/decoding 
> concern related to versions prior to 2.13:
> [https://github.com/apache/commons-codec/commit/48b615756d1d770091ea3322eefc08011ee8b113]
> Upgrade to the latest version of commons-codec in order to include this fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33762) Bump commons-codec to latest version.

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33762:


Assignee: Apache Spark

> Bump commons-codec to latest version. 
> --
>
> Key: SPARK-33762
> URL: https://issues.apache.org/jira/browse/SPARK-33762
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Nicholas Marion
>Assignee: Apache Spark
>Priority: Major
>
> Currently Spark pulls in commons-codec version 2.10 which was released 6 
> years ago. Some Open Source scans have found a possible encoding/decoding 
> concern related to versions prior to 2.13:
> [https://github.com/apache/commons-codec/commit/48b615756d1d770091ea3322eefc08011ee8b113]
> Upgrade to the latest version of commons-codec in order to include this fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33762) Bump commons-codec to latest version.

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33762:


Assignee: (was: Apache Spark)

> Bump commons-codec to latest version. 
> --
>
> Key: SPARK-33762
> URL: https://issues.apache.org/jira/browse/SPARK-33762
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Nicholas Marion
>Priority: Major
>
> Currently Spark pulls in commons-codec version 2.10 which was released 6 
> years ago. Some Open Source scans have found a possible encoding/decoding 
> concern related to versions prior to 2.13:
> [https://github.com/apache/commons-codec/commit/48b615756d1d770091ea3322eefc08011ee8b113]
> Upgrade to the latest version of commons-codec in order to include this fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32526) Let sql/catalyst module tests pass for Scala 2.13

2020-12-11 Thread Darcy Shen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248259#comment-17248259
 ] 

Darcy Shen commented on SPARK-32526:


ExpressionEncoderSuite still fails to work:

https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.1-test-maven-hadoop-2.7-scala-2.13/lastFailedBuild/consoleFull

However, the following command seems to work fine:

build/mvn -Pscala-2.13 -Dtest=none 
-DwildcardSuites=org.apache.spark.sql.catalyst.encoders.ExpressionEncoderSuite 
test -pl sql/catalyst -am

> Let sql/catalyst module tests pass for Scala 2.13
> -
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: failed-and-aborted-20200806
>
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33653) DSv2: REFRESH TABLE should recache the table itself

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33653:


Assignee: (was: Apache Spark)

> DSv2: REFRESH TABLE should recache the table itself
> ---
>
> Key: SPARK-33653
> URL: https://issues.apache.org/jira/browse/SPARK-33653
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> As "CACHE TABLE" is supported in DSv2 now, we should also recache the table 
> itself in "REFRESH TABLE" command, to match the behavior in DSv1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33653) DSv2: REFRESH TABLE should recache the table itself

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33653:


Assignee: Apache Spark

> DSv2: REFRESH TABLE should recache the table itself
> ---
>
> Key: SPARK-33653
> URL: https://issues.apache.org/jira/browse/SPARK-33653
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> As "CACHE TABLE" is supported in DSv2 now, we should also recache the table 
> itself in "REFRESH TABLE" command, to match the behavior in DSv1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33653) DSv2: REFRESH TABLE should recache the table itself

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248257#comment-17248257
 ] 

Apache Spark commented on SPARK-33653:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30742

> DSv2: REFRESH TABLE should recache the table itself
> ---
>
> Key: SPARK-33653
> URL: https://issues.apache.org/jira/browse/SPARK-33653
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> As "CACHE TABLE" is supported in DSv2 now, we should also recache the table 
> itself in "REFRESH TABLE" command, to match the behavior in DSv1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-11 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248241#comment-17248241
 ] 

Xinli Shang commented on SPARK-26345:
-

The Presto and Iceberg effort are not tied to each other. It is just some 
common code I can reuse. The PR in Iceberg is 
https://github.com/apache/iceberg/pull/1566 and the Issue for Presto is 
https://github.com/prestodb/presto/issues/15454 (PR is under development now). 


> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-11 Thread James R. Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248236#comment-17248236
 ] 

James R. Taylor commented on SPARK-26345:
-

Thanks for the update, [~sha...@uber.com]. I had just read that blog and it 
does indeed look promising.

Is the Presto support you mentioned tied to Iceberg or is it independent of 
that? Any PRs I could follow along on?

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33729) When refreshing cache, Spark should not use cached plan when recaching data

2020-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33729.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30699
[https://github.com/apache/spark/pull/30699]

> When refreshing cache, Spark should not use cached plan when recaching data
> ---
>
> Key: SPARK-33729
> URL: https://issues.apache.org/jira/browse/SPARK-33729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently when cache is refreshed, e.g., via "REFRESH TABLE" command, Spark 
> will call {{refreshTable}} method within {{CatalogImpl}}.
> {code}
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> val tableMetadata = 
> sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
> val table = sparkSession.table(tableIdent)
> if (tableMetadata.tableType == CatalogTableType.VIEW) {
>   // Temp or persistent views: refresh (or invalidate) any metadata/data 
> cached
>   // in the plan recursively.
>   table.queryExecution.analyzed.refresh()
> } else {
>   // Non-temp tables: refresh the metadata cache.
>   sessionCatalog.refreshTable(tableIdent)
> }
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val cache = sparkSession.sharedState.cacheManager.lookupCachedData(table)
> // uncache the logical plan.
> // note this is a no-op for the table itself if it's not cached, but will 
> invalidate all
> // caches referencing this table.
> sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true)
> if (cache.nonEmpty) {
>   // save the cache name and cache level for recreation
>   val cacheName = cache.get.cachedRepresentation.cacheBuilder.tableName
>   val cacheLevel = 
> cache.get.cachedRepresentation.cacheBuilder.storageLevel
>   // recache with the same name and cache level.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, cacheName, 
> cacheLevel)
> }
>   }
> {code}
> Note that the {{table}} is created before the table relation cache is 
> cleared, and used later in {{cacheQuery}}. This is incorrect since it still 
> refers cached table relation which could be stale.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33729) When refreshing cache, Spark should not use cached plan when recaching data

2020-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33729:
-

Assignee: Chao Sun

> When refreshing cache, Spark should not use cached plan when recaching data
> ---
>
> Key: SPARK-33729
> URL: https://issues.apache.org/jira/browse/SPARK-33729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> Currently when cache is refreshed, e.g., via "REFRESH TABLE" command, Spark 
> will call {{refreshTable}} method within {{CatalogImpl}}.
> {code}
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> val tableMetadata = 
> sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
> val table = sparkSession.table(tableIdent)
> if (tableMetadata.tableType == CatalogTableType.VIEW) {
>   // Temp or persistent views: refresh (or invalidate) any metadata/data 
> cached
>   // in the plan recursively.
>   table.queryExecution.analyzed.refresh()
> } else {
>   // Non-temp tables: refresh the metadata cache.
>   sessionCatalog.refreshTable(tableIdent)
> }
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val cache = sparkSession.sharedState.cacheManager.lookupCachedData(table)
> // uncache the logical plan.
> // note this is a no-op for the table itself if it's not cached, but will 
> invalidate all
> // caches referencing this table.
> sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true)
> if (cache.nonEmpty) {
>   // save the cache name and cache level for recreation
>   val cacheName = cache.get.cachedRepresentation.cacheBuilder.tableName
>   val cacheLevel = 
> cache.get.cachedRepresentation.cacheBuilder.storageLevel
>   // recache with the same name and cache level.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, cacheName, 
> cacheLevel)
> }
>   }
> {code}
> Note that the {{table}} is created before the table relation cache is 
> cleared, and used later in {{cacheQuery}}. This is incorrect since it still 
> refers cached table relation which could be stale.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-11 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248231#comment-17248231
 ] 

Xinli Shang commented on SPARK-26345:
-

For the performance, there is an Eng Blog I found online written by Zoltán 
Borók-Nagy& Gábor Szádovszky. Here is the link 
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/.
 

Once Spark is on Parquet 1.11.x, we can work on the Column Index for Spark 
Vectorized reader. Currently, I am working on integrating Column Index to 
Iceberg and Presto. The local testing on Iceberg also seems promising. 

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33764) Make state store maintenance interval as SQL config

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33764:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Make state store maintenance interval as SQL config
> ---
>
> Key: SPARK-33764
> URL: https://issues.apache.org/jira/browse/SPARK-33764
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently the maintenance interval is hard-coded in StateStore. It's better 
> to move it as SQL config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33764) Make state store maintenance interval as SQL config

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33764:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Make state store maintenance interval as SQL config
> ---
>
> Key: SPARK-33764
> URL: https://issues.apache.org/jira/browse/SPARK-33764
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Currently the maintenance interval is hard-coded in StateStore. It's better 
> to move it as SQL config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33764) Make state store maintenance interval as SQL config

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248217#comment-17248217
 ] 

Apache Spark commented on SPARK-33764:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30741

> Make state store maintenance interval as SQL config
> ---
>
> Key: SPARK-33764
> URL: https://issues.apache.org/jira/browse/SPARK-33764
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently the maintenance interval is hard-coded in StateStore. It's better 
> to move it as SQL config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-11 Thread James R. Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248216#comment-17248216
 ] 

James R. Taylor commented on SPARK-26345:
-

Any updates on this issue, [~zi]? Wouldn't column indexes help performance 
quite a bit, especially if filtered column is clustered or sorted?

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33764) Make state store maintenance interval as SQL config

2020-12-11 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-33764:
---

 Summary: Make state store maintenance interval as SQL config
 Key: SPARK-33764
 URL: https://issues.apache.org/jira/browse/SPARK-33764
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Currently the maintenance interval is hard-coded in StateStore. It's better to 
move it as SQL config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32617) Upgrade kubernetes client version to support latest minikube version.

2020-12-11 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248168#comment-17248168
 ] 

Attila Zsolt Piros commented on SPARK-32617:


I will have PR for this soon. 

> Upgrade kubernetes client version to support latest minikube version.
> -
>
> Key: SPARK-32617
> URL: https://issues.apache.org/jira/browse/SPARK-32617
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Following error comes, when the k8s integration tests are run against the 
> minikube cluster with version 1.2.1
> {code:java}
> Run starting. Expected test count is: 18
> KubernetesSuite:
> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED ***
>   io.fabric8.kubernetes.client.KubernetesClientException: An error has 
> occurred.
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
>   at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:196)
>   at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:62)
>   at io.fabric8.kubernetes.client.BaseClient.(BaseClient.java:51)
>   at 
> io.fabric8.kubernetes.client.DefaultKubernetesClient.(DefaultKubernetesClient.java:105)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.getKubernetesClient(Minikube.scala:81)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend$.initialize(MinikubeTestBackend.scala:33)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:131)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   ...
>   Cause: java.nio.file.NoSuchFileException: /root/.minikube/apiserver.crt
>   at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
>   at java.nio.file.Files.newByteChannel(Files.java:361)
>   at java.nio.file.Files.newByteChannel(Files.java:407)
>   at java.nio.file.Files.readAllBytes(Files.java:3152)
>   at 
> io.fabric8.kubernetes.client.internal.CertUtils.getInputStreamFromDataOrFile(CertUtils.java:72)
>   at 
> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:242)
>   at 
> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128)
>   ...
> Run completed in 1 second, 821 milliseconds.
> Total number of tests run: 0
> Suites: completed 1, aborted 1
> Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
> *** 1 SUITE ABORTED ***
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.1.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  4.454 
> s]
> [INFO] Spark Project Tags . SUCCESS [  4.768 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  2.961 
> s]
> [INFO] Spark Project Networking ... SUCCESS [  4.258 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  5.703 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [  3.239 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  3.224 
> s]
> [INFO] Spark Project Core . SUCCESS [02:25 
> min]
> [INFO] Spark Project Kubernetes Integration Tests . FAILURE [ 17.244 
> s]
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time:  03:12 min
> [INFO] Finished at: 2020-08-11T06:26:15-05:00
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.scalatest:scalatest-maven-plugin:2.0.0:test (integration-test) on project 
> spark-kubernetes-integration-tests_2.12: There are test failures -> [Help 1]
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR] 
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] 

[jira] [Commented] (SPARK-33731) Standardize exception types

2020-12-11 Thread Shril Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248156#comment-17248156
 ] 

Shril Kumar commented on SPARK-33731:
-

Most welcome [~hyukjin.kwon]. :)

> Standardize exception types
> ---
>
> Key: SPARK-33731
> URL: https://issues.apache.org/jira/browse/SPARK-33731
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should:
> - have a better hierarchy for exception types
> - or at least use the default type of exceptions correctly instead of just 
> throwing a plain Exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33730) Standardize warning types

2020-12-11 Thread Shril Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248155#comment-17248155
 ] 

Shril Kumar commented on SPARK-33730:
-

[~hyukjin.kwon], thank you for your response. Looking forward to your details 
on [SPARK-33731|https://issues.apache.org/jira/browse/SPARK-33731].

[~zero323], please proceed. All the best. :)

> Standardize warning types
> -
>
> Key: SPARK-33730
> URL: https://issues.apache.org/jira/browse/SPARK-33730
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should use warnings properly per 
> [https://docs.python.org/3/library/warnings.html#warning-categories]
> In particular,
>  - we should use {{FutureWarning}} instead of {{DeprecationWarning}} for the 
> places we should show the warnings to end-users by default.
>  - we should __maybe__ think about customizing stacklevel 
> ([https://docs.python.org/3/library/warnings.html#warnings.warn]) like pandas 
> does.
>  - ...
> Current warnings are a bit messy and somewhat arbitrary.
> To be more explicit, we'll have to fix:
> {code:java}
> pyspark/context.py:warnings.warn(
> pyspark/context.py:warnings.warn(
> pyspark/ml/classification.py:warnings.warn("weightCol is 
> ignored, "
> pyspark/ml/clustering.py:warnings.warn("Deprecated in 3.0.0. It will 
> be removed in future versions. Use "
> pyspark/mllib/classification.py:warnings.warn(
> pyspark/mllib/feature.py:warnings.warn("Both withMean and withStd 
> are false. The model does nothing.")
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/rdd.py:warnings.warn("mapPartitionsWithSplit is deprecated; "
> pyspark/rdd.py:warnings.warn(
> pyspark/shell.py:warnings.warn("Failed to initialize Spark session.")
> pyspark/shuffle.py:warnings.warn("Please install psutil to have 
> better "
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn("to_replace is a dict 
> and value is not None. value will be ignored.")
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use degrees 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use radians 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use 
> approx_count_distinct instead.", DeprecationWarning)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/functions.py:warnings.warn(
> pyspark/sql/pandas/group_ops.py:warnings.warn(
> pyspark/sql/session.py:warnings.warn("Fall back to non-hive 
> support because failing to access HiveConf, "
> {code}
> PySpark prints warnings via using {{print}} in some places as well. We should 
> also see if we should switch and replace to {{warnings.warn}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33763) Add metrics for better tracking of dynamic allocation

2020-12-11 Thread Holden Karau (Jira)
Holden Karau created SPARK-33763:


 Summary: Add metrics for better tracking of dynamic allocation
 Key: SPARK-33763
 URL: https://issues.apache.org/jira/browse/SPARK-33763
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Holden Karau


We should add metrics to track the following:

1- Graceful decommissions & DA scheduled deletes

2- Jobs resubmitted

3- Fetch failures

4- Unexpected (e.g. non-Spark triggered) executor removals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33752) Avoid the getSimpleMessage of AnalysisException adds semicolon repeatedly

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248121#comment-17248121
 ] 

Apache Spark commented on SPARK-33752:
--

User 'n-marion' has created a pull request for this issue:
https://github.com/apache/spark/pull/30740

> Avoid the getSimpleMessage of AnalysisException adds semicolon repeatedly
> -
>
> Key: SPARK-33752
> URL: https://issues.apache.org/jira/browse/SPARK-33752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current getSimpleMessage of AnalysisException may adds semicolon 
> repeatedly. There show an example below:
> {code:java}
> select decode()
> {code}
> The output will be:
> {code:java}
> org.apache.spark.sql.AnalysisException
> Invalid number of arguments for function decode. Expected: 2; Found: 0;; line 
> 1 pos 7
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33762) Bump commons-codec to latest version.

2020-12-11 Thread Nicholas Marion (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Marion updated SPARK-33762:

Description: 
Currently Spark pulls in commons-codec version 2.10 which was released 6 years 
ago. Some Open Source scans have found a possible encoding/decoding concern 
related to versions prior to 2.13:

[https://github.com/apache/commons-codec/commit/48b615756d1d770091ea3322eefc08011ee8b113]

Upgrade to the latest version of commons-codec in order to include this fix.

  was:
Currently Spark pulls in commons-codec version 2.12 which was released 2 years 
ago. Some Open Source scans have found a possible encoding/decoding concern 
related to versions prior to 2.13:

[https://github.com/apache/commons-codec/commit/48b615756d1d770091ea3322eefc08011ee8b113]

Upgrade to the latest version of commons-codec in order to include this fix.


> Bump commons-codec to latest version. 
> --
>
> Key: SPARK-33762
> URL: https://issues.apache.org/jira/browse/SPARK-33762
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Nicholas Marion
>Priority: Major
>
> Currently Spark pulls in commons-codec version 2.10 which was released 6 
> years ago. Some Open Source scans have found a possible encoding/decoding 
> concern related to versions prior to 2.13:
> [https://github.com/apache/commons-codec/commit/48b615756d1d770091ea3322eefc08011ee8b113]
> Upgrade to the latest version of commons-codec in order to include this fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33762) Bump commons-codec to latest version.

2020-12-11 Thread Nicholas Marion (Jira)
Nicholas Marion created SPARK-33762:
---

 Summary: Bump commons-codec to latest version. 
 Key: SPARK-33762
 URL: https://issues.apache.org/jira/browse/SPARK-33762
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 3.0.1, 2.4.7
Reporter: Nicholas Marion


Currently Spark pulls in commons-codec version 2.12 which was released 2 years 
ago. Some Open Source scans have found a possible encoding/decoding concern 
related to versions prior to 2.13:

[https://github.com/apache/commons-codec/commit/48b615756d1d770091ea3322eefc08011ee8b113]

Upgrade to the latest version of commons-codec in order to include this fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22256) Introduce spark.mesos.driver.memoryOverhead

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248112#comment-17248112
 ] 

Apache Spark commented on SPARK-22256:
--

User 'dmcwhorter' has created a pull request for this issue:
https://github.com/apache/spark/pull/30739

> Introduce spark.mesos.driver.memoryOverhead 
> 
>
> Key: SPARK-22256
> URL: https://issues.apache.org/jira/browse/SPARK-22256
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Cosmin Lehene
>Priority: Minor
>  Labels: docker, memory, mesos
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When running spark driver in a container such as when using the Mesos 
> dispatcher service, we need to apply the same rules as for executors in order 
> to avoid the JVM going over the allotted limit and then killed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22256) Introduce spark.mesos.driver.memoryOverhead

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248110#comment-17248110
 ] 

Apache Spark commented on SPARK-22256:
--

User 'dmcwhorter' has created a pull request for this issue:
https://github.com/apache/spark/pull/30739

> Introduce spark.mesos.driver.memoryOverhead 
> 
>
> Key: SPARK-22256
> URL: https://issues.apache.org/jira/browse/SPARK-22256
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Cosmin Lehene
>Priority: Minor
>  Labels: docker, memory, mesos
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When running spark driver in a container such as when using the Mesos 
> dispatcher service, we need to apply the same rules as for executors in order 
> to avoid the JVM going over the allotted limit and then killed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33761) [k8s] Support fetching driver and executor pod templates from HCFS

2020-12-11 Thread Xuzhou Yin (Jira)
Xuzhou Yin created SPARK-33761:
--

 Summary: [k8s] Support fetching driver and executor pod templates 
from HCFS
 Key: SPARK-33761
 URL: https://issues.apache.org/jira/browse/SPARK-33761
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.1
Reporter: Xuzhou Yin


Currently Spark 3 on Kubernetes supports loading driver and executor pod 
templates from Local file system: 
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala#L87.]
 However, this is not very convenient as user needs to either bake the pod 
templates into client pod image or manually mounting the file as configMap. It 
would be nice if Spark supports loading pod templates from Hadoop Compatible 
File Systems (such as S3A), so that user can directly update the pod template 
files in S3 without changing the underline Kubernetes job definition (eg. 
Updating Docker image or updating ConfigMap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33695) Bump Jackson to 2.10.5 and databind to 2.10.5.1

2020-12-11 Thread Nicholas Marion (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248107#comment-17248107
 ] 

Nicholas Marion commented on SPARK-33695:
-

[~dongjoon] ,

As a security issue, would this qualify for a fix in an earlier version, such 
as Spark 2.4.7? If so, I can create a backport.

> Bump Jackson to 2.10.5 and databind to 2.10.5.1
> ---
>
> Key: SPARK-33695
> URL: https://issues.apache.org/jira/browse/SPARK-33695
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Nicholas Marion
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> Jackson reported a vulnerability under CVE-2020-25649. The version pulled in 
> Spark currently is 2.10.0. Upgrading to 2.10.5.1 will resolve problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.

2020-12-11 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-33576.
--
Resolution: Duplicate

Going to resolve as a duplicate, but please reopen if you find it is different

> PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC 
> message: negative bodyLength'.
> -
>
> Key: SPARK-33576
> URL: https://issues.apache.org/jira/browse/SPARK-33576
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
> Environment: Databricks runtime 7.3
> Spakr 3.0.1
> Scala 2.12
>Reporter: Darshat
>Priority: Major
>
> Hello,
> We are using Databricks on Azure to process large amount of ecommerce data. 
> Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12.
> During processing, there is a groupby operation on the DataFrame that 
> consistently gets an exception of this type:
>  
> {color:#ff}PythonException: An exception was thrown from a UDF: 'OSError: 
> Invalid IPC message: negative bodyLength'. Full traceback below: Traceback 
> (most recent call last): File "/databricks/spark/python/pyspark/worker.py", 
> line 654, in main process() File 
> "/databricks/spark/python/pyspark/worker.py", line 646, in process 
> serializer.dump_stream(out_iter, outfile) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in 
> dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in 
> dump_stream for batch in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in 
> init_stream_yield_batches for series in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in 
> load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in 
> __iter__ File "pyarrow/ipc.pxi", line 432, in 
> pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", 
> line 99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative 
> bodyLength{color}
>  
> Code that causes this:
> {color:#ff}x = df.groupby('providerid').apply(domain_features){color}
> {color:#ff}display(x.info()){color}
> Dataframe size - 22 million rows, 31 columns
>  One of the columns is a string ('providerid') on which we do a groupby 
> followed by an apply  operation. There are 3 distinct provider ids in this 
> set. While trying to enumerate/count the results, we get this exception.
> We've put all possible checks in the code for null values, or corrupt data 
> and we are not able to track this to application level code. I hope we can 
> get some help troubleshooting this as this is a blocker for rolling out at 
> scale.
> The cluster has 8 nodes + driver, all 28GB RAM. I can provide any other 
> settings that could be useful. 
>  Hope to get some insights into the problem. 
> Thanks,
> Darshat Shah



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33742) Throw PartitionsAlreadyExistException from HiveExternalCatalog.createPartitions()

2020-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33742:
--
Fix Version/s: (was: 3.2.0)
   3.1.0
   3.0.2
   2.4.8

> Throw PartitionsAlreadyExistException from 
> HiveExternalCatalog.createPartitions()
> -
>
> Key: SPARK-33742
> URL: https://issues.apache.org/jira/browse/SPARK-33742
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> HiveExternalCatalog.createPartitions throws AlreadyExistsException wrapped by 
> AnalysisException. The behavior deviates from V1/V2 in-memory catalogs that 
> throw PartitionsAlreadyExistException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.

2020-12-11 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248064#comment-17248064
 ] 

Bryan Cutler commented on SPARK-33576:
--

[~darshats] I believe the only current workaround is to further split your 
groups with other keys to get under the 2GB limit. To take advantage of the new 
Arrow improvements for this would most likely require some work on the Spark 
side, but I'd have to look into it more.

> PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC 
> message: negative bodyLength'.
> -
>
> Key: SPARK-33576
> URL: https://issues.apache.org/jira/browse/SPARK-33576
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
> Environment: Databricks runtime 7.3
> Spakr 3.0.1
> Scala 2.12
>Reporter: Darshat
>Priority: Major
>
> Hello,
> We are using Databricks on Azure to process large amount of ecommerce data. 
> Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12.
> During processing, there is a groupby operation on the DataFrame that 
> consistently gets an exception of this type:
>  
> {color:#ff}PythonException: An exception was thrown from a UDF: 'OSError: 
> Invalid IPC message: negative bodyLength'. Full traceback below: Traceback 
> (most recent call last): File "/databricks/spark/python/pyspark/worker.py", 
> line 654, in main process() File 
> "/databricks/spark/python/pyspark/worker.py", line 646, in process 
> serializer.dump_stream(out_iter, outfile) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in 
> dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in 
> dump_stream for batch in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in 
> init_stream_yield_batches for series in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in 
> load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in 
> __iter__ File "pyarrow/ipc.pxi", line 432, in 
> pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", 
> line 99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative 
> bodyLength{color}
>  
> Code that causes this:
> {color:#ff}x = df.groupby('providerid').apply(domain_features){color}
> {color:#ff}display(x.info()){color}
> Dataframe size - 22 million rows, 31 columns
>  One of the columns is a string ('providerid') on which we do a groupby 
> followed by an apply  operation. There are 3 distinct provider ids in this 
> set. While trying to enumerate/count the results, we get this exception.
> We've put all possible checks in the code for null values, or corrupt data 
> and we are not able to track this to application level code. I hope we can 
> get some help troubleshooting this as this is a blocker for rolling out at 
> scale.
> The cluster has 8 nodes + driver, all 28GB RAM. I can provide any other 
> settings that could be useful. 
>  Hope to get some insights into the problem. 
> Thanks,
> Darshat Shah



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33730) Standardize warning types

2020-12-11 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248051#comment-17248051
 ] 

Maciej Szymkiewicz edited comment on SPARK-33730 at 12/11/20, 5:30 PM:
---

{quote}Eh, I think Maciej Szymkiewicz is already working on it
{quote}
Yes, I already started working on that.
{quote}Do you want these print statements to be changed to warnings.warn?
{quote}


That's definitely the first step here, but I think we can do better. 
Specifically, we had an informal discussion about introducing a simple 
{{Warning}} hierarchy, that could be used for more precise warning suppression. 
i.e.

{code:python}
class PySparkWarning(Warning): pass
class PySparkDeprecationWarning(PySparkWarning, DeprecationWarning): pass
{code}


was (Author: zero323):
{quote}Eh, I think Maciej Szymkiewicz is already working on it
{quote}
Yes, I already started working on that.
{quote}Do you want these print statements to be changed to warnings.warn?
{quote}


That's definitely the first step here, but I think we can do better. 
Specifically, we had an informal discussion about introducing a simple 
{{Warning}} hierarchy, that could be used for more precise warning suppression. 
i.e.

{code:pyton}
class PySparkWarning(Warning): pass
class PySparkDeprecationWarning(PySparkWarning, DeprecationWarning): pass
{code}

> Standardize warning types
> -
>
> Key: SPARK-33730
> URL: https://issues.apache.org/jira/browse/SPARK-33730
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should use warnings properly per 
> [https://docs.python.org/3/library/warnings.html#warning-categories]
> In particular,
>  - we should use {{FutureWarning}} instead of {{DeprecationWarning}} for the 
> places we should show the warnings to end-users by default.
>  - we should __maybe__ think about customizing stacklevel 
> ([https://docs.python.org/3/library/warnings.html#warnings.warn]) like pandas 
> does.
>  - ...
> Current warnings are a bit messy and somewhat arbitrary.
> To be more explicit, we'll have to fix:
> {code:java}
> pyspark/context.py:warnings.warn(
> pyspark/context.py:warnings.warn(
> pyspark/ml/classification.py:warnings.warn("weightCol is 
> ignored, "
> pyspark/ml/clustering.py:warnings.warn("Deprecated in 3.0.0. It will 
> be removed in future versions. Use "
> pyspark/mllib/classification.py:warnings.warn(
> pyspark/mllib/feature.py:warnings.warn("Both withMean and withStd 
> are false. The model does nothing.")
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/rdd.py:warnings.warn("mapPartitionsWithSplit is deprecated; "
> pyspark/rdd.py:warnings.warn(
> pyspark/shell.py:warnings.warn("Failed to initialize Spark session.")
> pyspark/shuffle.py:warnings.warn("Please install psutil to have 
> better "
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn("to_replace is a dict 
> and value is not None. value will be ignored.")
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use degrees 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use radians 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use 
> approx_count_distinct instead.", DeprecationWarning)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/functions.py:warnings.warn(
> pyspark/sql/pandas/group_ops.py:warnings.warn(
> pyspark/sql/session.py:warnings.warn("Fall back to non-hive 
> support because failing to access HiveConf, "
> {code}
> PySpark prints warnings via using {{print}} in some places as well. We should 
> also see if we should switch and replace to 

[jira] [Comment Edited] (SPARK-33730) Standardize warning types

2020-12-11 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248051#comment-17248051
 ] 

Maciej Szymkiewicz edited comment on SPARK-33730 at 12/11/20, 5:29 PM:
---

{quote}Eh, I think Maciej Szymkiewicz is already working on it
{quote}
Yes, I already started working on that.
{quote}Do you want these print statements to be changed to warnings.warn?
{quote}


That's definitely the first step here, but I think we can do better. 
Specifically, we had an informal discussion about introducing a simple 
{{Warning}} hierarchy, that could be used for more precise warning suppression. 
i.e.

{code:pyton}
class PySparkWarning(Warning): pass
class PySparkDeprecationWarning(PySparkWarning, DeprecationWarning): pass
{code}


was (Author: zero323):
{quote}Eh, I think Maciej Szymkiewicz is already working on it
{quote}
Yes, I already started working on that.
{quote}Do you want these print statements to be changed to warnings.warn?
{quote}

That's definitely the first step here, but I think we can do better. 
Specifically, we had an informal discussion about introducing a simple 
{{Warning}} hierarchy, that could be used for more precise warning suppression. 
i.e.

class PySparkWarning(Warning): pass
class PySparkDeprecationWarning(PySparkWarning, DeprecationWarning): pass

> Standardize warning types
> -
>
> Key: SPARK-33730
> URL: https://issues.apache.org/jira/browse/SPARK-33730
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should use warnings properly per 
> [https://docs.python.org/3/library/warnings.html#warning-categories]
> In particular,
>  - we should use {{FutureWarning}} instead of {{DeprecationWarning}} for the 
> places we should show the warnings to end-users by default.
>  - we should __maybe__ think about customizing stacklevel 
> ([https://docs.python.org/3/library/warnings.html#warnings.warn]) like pandas 
> does.
>  - ...
> Current warnings are a bit messy and somewhat arbitrary.
> To be more explicit, we'll have to fix:
> {code:java}
> pyspark/context.py:warnings.warn(
> pyspark/context.py:warnings.warn(
> pyspark/ml/classification.py:warnings.warn("weightCol is 
> ignored, "
> pyspark/ml/clustering.py:warnings.warn("Deprecated in 3.0.0. It will 
> be removed in future versions. Use "
> pyspark/mllib/classification.py:warnings.warn(
> pyspark/mllib/feature.py:warnings.warn("Both withMean and withStd 
> are false. The model does nothing.")
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/rdd.py:warnings.warn("mapPartitionsWithSplit is deprecated; "
> pyspark/rdd.py:warnings.warn(
> pyspark/shell.py:warnings.warn("Failed to initialize Spark session.")
> pyspark/shuffle.py:warnings.warn("Please install psutil to have 
> better "
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn("to_replace is a dict 
> and value is not None. value will be ignored.")
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use degrees 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use radians 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use 
> approx_count_distinct instead.", DeprecationWarning)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/functions.py:warnings.warn(
> pyspark/sql/pandas/group_ops.py:warnings.warn(
> pyspark/sql/session.py:warnings.warn("Fall back to non-hive 
> support because failing to access HiveConf, "
> {code}
> PySpark prints warnings via using {{print}} in some places as well. We should 
> also see if we should switch and replace to {{warnings.warn}}.



--
This 

[jira] [Commented] (SPARK-33730) Standardize warning types

2020-12-11 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248051#comment-17248051
 ] 

Maciej Szymkiewicz commented on SPARK-33730:


{quote}Eh, I think Maciej Szymkiewicz is already working on it
{quote}
Yes, I already started working on that.
{quote}Do you want these print statements to be changed to warnings.warn?
{quote}

That's definitely the first step here, but I think we can do better. 
Specifically, we had an informal discussion about introducing a simple 
{{Warning}} hierarchy, that could be used for more precise warning suppression. 
i.e.

class PySparkWarning(Warning): pass
class PySparkDeprecationWarning(PySparkWarning, DeprecationWarning): pass

> Standardize warning types
> -
>
> Key: SPARK-33730
> URL: https://issues.apache.org/jira/browse/SPARK-33730
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should use warnings properly per 
> [https://docs.python.org/3/library/warnings.html#warning-categories]
> In particular,
>  - we should use {{FutureWarning}} instead of {{DeprecationWarning}} for the 
> places we should show the warnings to end-users by default.
>  - we should __maybe__ think about customizing stacklevel 
> ([https://docs.python.org/3/library/warnings.html#warnings.warn]) like pandas 
> does.
>  - ...
> Current warnings are a bit messy and somewhat arbitrary.
> To be more explicit, we'll have to fix:
> {code:java}
> pyspark/context.py:warnings.warn(
> pyspark/context.py:warnings.warn(
> pyspark/ml/classification.py:warnings.warn("weightCol is 
> ignored, "
> pyspark/ml/clustering.py:warnings.warn("Deprecated in 3.0.0. It will 
> be removed in future versions. Use "
> pyspark/mllib/classification.py:warnings.warn(
> pyspark/mllib/feature.py:warnings.warn("Both withMean and withStd 
> are false. The model does nothing.")
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/rdd.py:warnings.warn("mapPartitionsWithSplit is deprecated; "
> pyspark/rdd.py:warnings.warn(
> pyspark/shell.py:warnings.warn("Failed to initialize Spark session.")
> pyspark/shuffle.py:warnings.warn("Please install psutil to have 
> better "
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn("to_replace is a dict 
> and value is not None. value will be ignored.")
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use degrees 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use radians 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use 
> approx_count_distinct instead.", DeprecationWarning)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/functions.py:warnings.warn(
> pyspark/sql/pandas/group_ops.py:warnings.warn(
> pyspark/sql/session.py:warnings.warn("Fall back to non-hive 
> support because failing to access HiveConf, "
> {code}
> PySpark prints warnings via using {{print}} in some places as well. We should 
> also see if we should switch and replace to {{warnings.warn}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33760) Extend Dynamic Partition Pruning Support to DataSources

2020-12-11 Thread Anoop Johnson (Jira)
Anoop Johnson created SPARK-33760:
-

 Summary: Extend Dynamic Partition Pruning Support to DataSources
 Key: SPARK-33760
 URL: https://issues.apache.org/jira/browse/SPARK-33760
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Anoop Johnson


The implementation of Dynamic Partition Pruning  (DPP) in Spark is 
[specific|https://github.com/apache/spark/blob/fb2e3af4b5d92398d57e61b766466cc7efd9d7cb/sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala#L59-L64]
 to HadoopFSRelation. As a result, DPP is not triggered for queries that use 
data sources. 

The DataSource v2 readers can expose the partition metadata. Can we use this 
metadata and extend DPP to work on data sources as well?

Would appreciate thoughts or corner cases we need to handle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32398) Upgrade to scalatest 3.2.0 for Scala 2.13.3 compatibility

2020-12-11 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248021#comment-17248021
 ] 

Steve Loughran commented on SPARK-32398:


doesn't work with old versions though. I think I'll end up having to do some 
JAR containing nothing but a FunSuite parent and doing separate releases for 
the different versions. 

bq. I continue to hope that these API-breaking changes won't continue,

yeah, they should have regression tests for this

> Upgrade to scalatest 3.2.0 for Scala 2.13.3 compatibility
> -
>
> Key: SPARK-32398
> URL: https://issues.apache.org/jira/browse/SPARK-32398
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Spark Core, SQL, Structured Streaming, Tests
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
> Fix For: 3.1.0
>
>
> We'll need to update to scalatest 3.2.0 in order to pick up the fix here, 
> which fixes an incompatibility with Scala 2.13.3:
> https://github.com/scalatest/scalatest/commit/7c89416aa9f3e7f2730a343ad6d3bdcff65809de
> That's a big change unfortunately - 3.1 / 3.2 reorganized many classes. 
> Fortunately it's just like import updates in 100 files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33044) Add a Jenkins build and test job for Scala 2.13

2020-12-11 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248013#comment-17248013
 ] 

Shane Knapp commented on SPARK-33044:
-

done

> Add a Jenkins build and test job for Scala 2.13
> ---
>
> Key: SPARK-33044
> URL: https://issues.apache.org/jira/browse/SPARK-33044
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Shane Knapp
>Priority: Major
> Attachments: Screen Shot 2020-12-08 at 1.56.59 PM.png, Screen Shot 
> 2020-12-08 at 1.58.07 PM.png
>
>
> {{Master}} branch seems to be almost ready for Scala 2.13 now, we need a 
> Jenkins test job to verify current work results and CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33746) Minikube is failing to start on research-jenkins-worker-05

2020-12-11 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-33746.
-
Resolution: Fixed

> Minikube is failing to start on research-jenkins-worker-05
> --
>
> Key: SPARK-33746
> URL: https://issues.apache.org/jira/browse/SPARK-33746
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Holden Karau
>Assignee: Shane Knapp
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37190/console]
>  
> {code:java}
> + minikube --vm-driver=kvm2 start --memory 6000 --cpus 8
> * minikube v1.7.3 on Ubuntu 20.04
> * Using the kvm2 driver based on user configuration
> ! Unable to update kvm2 driver: unable to acquire lock for 
> {Name:mk900956b073697a4aa6c80a27c6bb0742a99a53 Clock:{} Delay:500ms 
> Timeout:10m0s Cancel:}: unable to open 
> /tmp/juju-mk900956b073697a4aa6c80a27c6bb0742a99a53: permission denied
> * Kubernetes 1.17.3 is now available. If you would like to upgrade, specify: 
> --kubernetes-version=1.17.3
> * Reconfiguring existing host ...
> * Using the running kvm2 "minikube" VM ...
> * 
> X Unable to start VM. Please investigate and run 'minikube delete' if possible
> * Error: [SSH_AUTH_FAILURE] post-start: command runner: ssh client: Error 
> dialing tcp via ssh client: ssh: handshake failed: ssh: unable to 
> authenticate, attempted methods [none publickey], no supported methods 
> remain{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33759) docker entrypoint should using `spark-class` for spark executor

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33759:


Assignee: (was: Apache Spark)

> docker entrypoint should using `spark-class` for spark executor
> ---
>
> Key: SPARK-33759
> URL: https://issues.apache.org/jira/browse/SPARK-33759
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Dung Dang Minh
>Priority: Trivial
>
> In docker {{entrypoint.sh}}, spark driver using {{spark-submit}} command but 
> spark executor using pure {{java}} command which don't load {{spark-env.sh}} 
> in {{$SPARK_HOME/conf}} directory.
> This can lead configuration mismatch between driver and executors in the 
> cases {{spark-env.sh}} contains something like custom envvars or pre-start 
> hooks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33759) docker entrypoint should using `spark-class` for spark executor

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248004#comment-17248004
 ] 

Apache Spark commented on SPARK-33759:
--

User 'dungdm93' has created a pull request for this issue:
https://github.com/apache/spark/pull/30738

> docker entrypoint should using `spark-class` for spark executor
> ---
>
> Key: SPARK-33759
> URL: https://issues.apache.org/jira/browse/SPARK-33759
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Dung Dang Minh
>Priority: Trivial
>
> In docker {{entrypoint.sh}}, spark driver using {{spark-submit}} command but 
> spark executor using pure {{java}} command which don't load {{spark-env.sh}} 
> in {{$SPARK_HOME/conf}} directory.
> This can lead configuration mismatch between driver and executors in the 
> cases {{spark-env.sh}} contains something like custom envvars or pre-start 
> hooks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33759) docker entrypoint should using `spark-class` for spark executor

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248006#comment-17248006
 ] 

Apache Spark commented on SPARK-33759:
--

User 'dungdm93' has created a pull request for this issue:
https://github.com/apache/spark/pull/30738

> docker entrypoint should using `spark-class` for spark executor
> ---
>
> Key: SPARK-33759
> URL: https://issues.apache.org/jira/browse/SPARK-33759
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Dung Dang Minh
>Priority: Trivial
>
> In docker {{entrypoint.sh}}, spark driver using {{spark-submit}} command but 
> spark executor using pure {{java}} command which don't load {{spark-env.sh}} 
> in {{$SPARK_HOME/conf}} directory.
> This can lead configuration mismatch between driver and executors in the 
> cases {{spark-env.sh}} contains something like custom envvars or pre-start 
> hooks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33759) docker entrypoint should using `spark-class` for spark executor

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33759:


Assignee: Apache Spark

> docker entrypoint should using `spark-class` for spark executor
> ---
>
> Key: SPARK-33759
> URL: https://issues.apache.org/jira/browse/SPARK-33759
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Dung Dang Minh
>Assignee: Apache Spark
>Priority: Trivial
>
> In docker {{entrypoint.sh}}, spark driver using {{spark-submit}} command but 
> spark executor using pure {{java}} command which don't load {{spark-env.sh}} 
> in {{$SPARK_HOME/conf}} directory.
> This can lead configuration mismatch between driver and executors in the 
> cases {{spark-env.sh}} contains something like custom envvars or pre-start 
> hooks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33044) Add a Jenkins build and test job for Scala 2.13

2020-12-11 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248003#comment-17248003
 ] 

Shane Knapp commented on SPARK-33044:
-

ah, ok...  i thought i just needed to run `./dev/change-scala-version.sh 2.13`. 
 i'll update the build scripts now.

> Add a Jenkins build and test job for Scala 2.13
> ---
>
> Key: SPARK-33044
> URL: https://issues.apache.org/jira/browse/SPARK-33044
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Shane Knapp
>Priority: Major
> Attachments: Screen Shot 2020-12-08 at 1.56.59 PM.png, Screen Shot 
> 2020-12-08 at 1.58.07 PM.png
>
>
> {{Master}} branch seems to be almost ready for Scala 2.13 now, we need a 
> Jenkins test job to verify current work results and CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33759) docker entrypoint should using `spark-class` for spark executor

2020-12-11 Thread Dung Dang Minh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dung Dang Minh updated SPARK-33759:
---
Summary: docker entrypoint should using `spark-class` for spark executor  
(was: spark executor should using `spark-class` in docker entrypoint)

> docker entrypoint should using `spark-class` for spark executor
> ---
>
> Key: SPARK-33759
> URL: https://issues.apache.org/jira/browse/SPARK-33759
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Dung Dang Minh
>Priority: Trivial
>
> In docker {{entrypoint.sh}}, spark driver using {{spark-submit}} command but 
> spark executor using pure {{java}} command which don't load {{spark-env.sh}} 
> in {{$SPARK_HOME/conf}} directory.
> This can lead configuration mismatch between driver and executors in the 
> cases {{spark-env.sh}} contains something like custom envvars or pre-start 
> hooks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33759) spark executor should using `spark-class` in docker entrypoint

2020-12-11 Thread Dung Dang Minh (Jira)
Dung Dang Minh created SPARK-33759:
--

 Summary: spark executor should using `spark-class` in docker 
entrypoint
 Key: SPARK-33759
 URL: https://issues.apache.org/jira/browse/SPARK-33759
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.1
Reporter: Dung Dang Minh


In docker {{entrypoint.sh}}, spark driver using {{spark-submit}} command but 
spark executor using pure {{java}} command which don't load {{spark-env.sh}} in 
{{$SPARK_HOME/conf}} directory.
This can lead configuration mismatch between driver and executors in the cases 
{{spark-env.sh}} contains something like custom envvars or pre-start hooks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33746) Minikube is failing to start on research-jenkins-worker-05

2020-12-11 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247998#comment-17247998
 ] 

Shane Knapp commented on SPARK-33746:
-

k8s master build passed:

[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/709/]

re-enabling this worker now.

> Minikube is failing to start on research-jenkins-worker-05
> --
>
> Key: SPARK-33746
> URL: https://issues.apache.org/jira/browse/SPARK-33746
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Holden Karau
>Assignee: Shane Knapp
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37190/console]
>  
> {code:java}
> + minikube --vm-driver=kvm2 start --memory 6000 --cpus 8
> * minikube v1.7.3 on Ubuntu 20.04
> * Using the kvm2 driver based on user configuration
> ! Unable to update kvm2 driver: unable to acquire lock for 
> {Name:mk900956b073697a4aa6c80a27c6bb0742a99a53 Clock:{} Delay:500ms 
> Timeout:10m0s Cancel:}: unable to open 
> /tmp/juju-mk900956b073697a4aa6c80a27c6bb0742a99a53: permission denied
> * Kubernetes 1.17.3 is now available. If you would like to upgrade, specify: 
> --kubernetes-version=1.17.3
> * Reconfiguring existing host ...
> * Using the running kvm2 "minikube" VM ...
> * 
> X Unable to start VM. Please investigate and run 'minikube delete' if possible
> * Error: [SSH_AUTH_FAILURE] post-start: command runner: ssh client: Error 
> dialing tcp via ssh client: ssh: handshake failed: ssh: unable to 
> authenticate, attempted methods [none publickey], no supported methods 
> remain{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33757) Fix the R dependencies build error on GitHub Actions and AppVeyor

2020-12-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33757.
--
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30737
[https://github.com/apache/spark/pull/30737]

> Fix the R dependencies build error on GitHub Actions and AppVeyor
> -
>
> Key: SPARK-33757
> URL: https://issues.apache.org/jira/browse/SPARK-33757
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> R dependencies build error happens now.
> The reason seems that usethis package is updated 2020/12/10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33526) Add config to control if cancel invoke interrupt task on thriftserver

2020-12-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33526:


Assignee: ulysses you

> Add config to control if cancel invoke interrupt task on thriftserver
> -
>
> Key: SPARK-33526
> URL: https://issues.apache.org/jira/browse/SPARK-33526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> After [#29933|https://github.com/apache/spark/pull/29933], we support cancel 
> query if timeout, but the default behavior of `SparkContext.cancelJobGroups` 
> won't interrupt task and just let task finish by itself. In some case it's 
> dangerous, e.g., data skew or exists a heavily shuffle. A task will hold in a 
> long time after do cancel and the resource will not release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33526) Add config to control if cancel invoke interrupt task on thriftserver

2020-12-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33526.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30481
[https://github.com/apache/spark/pull/30481]

> Add config to control if cancel invoke interrupt task on thriftserver
> -
>
> Key: SPARK-33526
> URL: https://issues.apache.org/jira/browse/SPARK-33526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> After [#29933|https://github.com/apache/spark/pull/29933], we support cancel 
> query if timeout, but the default behavior of `SparkContext.cancelJobGroups` 
> won't interrupt task and just let task finish by itself. In some case it's 
> dangerous, e.g., data skew or exists a heavily shuffle. A task will hold in a 
> long time after do cancel and the resource will not release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33758) Prune unnecessary output partitioning when the attribute is not part of output.

2020-12-11 Thread Prakhar Jain (Jira)
Prakhar Jain created SPARK-33758:


 Summary: Prune unnecessary output partitioning when the attribute 
is not part of output.
 Key: SPARK-33758
 URL: https://issues.apache.org/jira/browse/SPARK-33758
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1, 3.1.0
Reporter: Prakhar Jain


Consider the query:

 

select t1.id from t1 JOIN t2 on t1.id = t2.id

 

This query will have top level Project node which will just project t1.id. But 
the outputPartitioning of this project node will be:

PartitioningCollection(HashPartitioning(t1.id), HashPartitioning(t2.id))

 

We should drop HashPartitioning(t2.id) from outputPartitioning of Project node.

 

cc - [~maropu] [~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33757) Fix the R dependencies build error on GitHub Actions and AppVeyor

2020-12-11 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-33757:
---
Description: 
R dependencies build error happens now.
The reason seems that usethis package is updated 2020/12/10.

  was:
R dependencies build error happens.
The reason seems that usethis package is updated 2020/12/10.


> Fix the R dependencies build error on GitHub Actions and AppVeyor
> -
>
> Key: SPARK-33757
> URL: https://issues.apache.org/jira/browse/SPARK-33757
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> R dependencies build error happens now.
> The reason seems that usethis package is updated 2020/12/10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33757) Fix the R dependencies build error on GitHub Actions and AppVeyor

2020-12-11 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-33757:
---
Summary: Fix the R dependencies build error on GitHub Actions and AppVeyor  
(was: Fix the R dependencies build error on GitHub Actions)

> Fix the R dependencies build error on GitHub Actions and AppVeyor
> -
>
> Key: SPARK-33757
> URL: https://issues.apache.org/jira/browse/SPARK-33757
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> R dependencies build error happens.
> The reason seems that usethis package is updated 2020/12/10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33757) Fix the R dependencies build error on GitHub Actions

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247930#comment-17247930
 ] 

Apache Spark commented on SPARK-33757:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/30737

> Fix the R dependencies build error on GitHub Actions
> 
>
> Key: SPARK-33757
> URL: https://issues.apache.org/jira/browse/SPARK-33757
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> R dependencies build error happens.
> The reason seems that usethis package is updated 2020/12/10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33757) Fix the R dependencies build error on GitHub Actions

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33757:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Fix the R dependencies build error on GitHub Actions
> 
>
> Key: SPARK-33757
> URL: https://issues.apache.org/jira/browse/SPARK-33757
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> R dependencies build error happens.
> The reason seems that usethis package is updated 2020/12/10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33757) Fix the R dependencies build error on GitHub Actions

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33757:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Fix the R dependencies build error on GitHub Actions
> 
>
> Key: SPARK-33757
> URL: https://issues.apache.org/jira/browse/SPARK-33757
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> R dependencies build error happens.
> The reason seems that usethis package is updated 2020/12/10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33757) Fix the R dependencies build error on GitHub Actions

2020-12-11 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-33757:
--

 Summary: Fix the R dependencies build error on GitHub Actions
 Key: SPARK-33757
 URL: https://issues.apache.org/jira/browse/SPARK-33757
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


R dependencies build error happens.
The reason seems that usethis package is updated 2020/12/10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33617) Avoid generating small files for INSERT INTO TABLE from VALUES

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247926#comment-17247926
 ] 

Apache Spark commented on SPARK-33617:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/30736

> Avoid generating small files for INSERT INTO TABLE from  VALUES
> ---
>
> Key: SPARK-33617
> URL: https://issues.apache.org/jira/browse/SPARK-33617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1(id int) stored as textfile;
> insert into table t1 values (1), (2), (3), (4), (5), (6), (7), (8);
> {code}
> It will generate these files:
> {noformat}
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-0-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-1-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-2-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-3-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-4-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-5-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-6-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-7-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33617) Avoid generating small files for INSERT INTO TABLE from VALUES

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247925#comment-17247925
 ] 

Apache Spark commented on SPARK-33617:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/30736

> Avoid generating small files for INSERT INTO TABLE from  VALUES
> ---
>
> Key: SPARK-33617
> URL: https://issues.apache.org/jira/browse/SPARK-33617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1(id int) stored as textfile;
> insert into table t1 values (1), (2), (3), (4), (5), (6), (7), (8);
> {code}
> It will generate these files:
> {noformat}
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-0-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-1-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-2-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-3-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-4-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-5-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-6-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-7-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33731) Standardize exception types

2020-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247923#comment-17247923
 ] 

Hyukjin Kwon commented on SPARK-33731:
--

Sure, thanks [~shril]. I have some thoughts on my mind. I will clarify it later 
in the JIRA description.

> Standardize exception types
> ---
>
> Key: SPARK-33731
> URL: https://issues.apache.org/jira/browse/SPARK-33731
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should:
> - have a better hierarchy for exception types
> - or at least use the default type of exceptions correctly instead of just 
> throwing a plain Exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33730) Standardize warning types

2020-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247922#comment-17247922
 ] 

Hyukjin Kwon edited comment on SPARK-33730 at 12/11/20, 1:17 PM:
-

Eh, I think [~zero323] is already working on it\(?\). I will share mode details 
in SPARK-33731 soon. What about taking a look for SPARK-33731 :-)?


was (Author: hyukjin.kwon):
Eh, I think [~zero323] is already working on it(?). I will share mode details 
in SPARK-33731 soon. What about taking a look for SPARK-33731 :-)?

> Standardize warning types
> -
>
> Key: SPARK-33730
> URL: https://issues.apache.org/jira/browse/SPARK-33730
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should use warnings properly per 
> [https://docs.python.org/3/library/warnings.html#warning-categories]
> In particular,
>  - we should use {{FutureWarning}} instead of {{DeprecationWarning}} for the 
> places we should show the warnings to end-users by default.
>  - we should __maybe__ think about customizing stacklevel 
> ([https://docs.python.org/3/library/warnings.html#warnings.warn]) like pandas 
> does.
>  - ...
> Current warnings are a bit messy and somewhat arbitrary.
> To be more explicit, we'll have to fix:
> {code:java}
> pyspark/context.py:warnings.warn(
> pyspark/context.py:warnings.warn(
> pyspark/ml/classification.py:warnings.warn("weightCol is 
> ignored, "
> pyspark/ml/clustering.py:warnings.warn("Deprecated in 3.0.0. It will 
> be removed in future versions. Use "
> pyspark/mllib/classification.py:warnings.warn(
> pyspark/mllib/feature.py:warnings.warn("Both withMean and withStd 
> are false. The model does nothing.")
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/rdd.py:warnings.warn("mapPartitionsWithSplit is deprecated; "
> pyspark/rdd.py:warnings.warn(
> pyspark/shell.py:warnings.warn("Failed to initialize Spark session.")
> pyspark/shuffle.py:warnings.warn("Please install psutil to have 
> better "
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn("to_replace is a dict 
> and value is not None. value will be ignored.")
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use degrees 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use radians 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use 
> approx_count_distinct instead.", DeprecationWarning)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/functions.py:warnings.warn(
> pyspark/sql/pandas/group_ops.py:warnings.warn(
> pyspark/sql/session.py:warnings.warn("Fall back to non-hive 
> support because failing to access HiveConf, "
> {code}
> PySpark prints warnings via using {{print}} in some places as well. We should 
> also see if we should switch and replace to {{warnings.warn}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33730) Standardize warning types

2020-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247922#comment-17247922
 ] 

Hyukjin Kwon commented on SPARK-33730:
--

Eh, I think [~zero323] is already working on it(?). I will share mode details 
in SPARK-33731 soon. What about taking a look for SPARK-33731 :-)?

> Standardize warning types
> -
>
> Key: SPARK-33730
> URL: https://issues.apache.org/jira/browse/SPARK-33730
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should use warnings properly per 
> [https://docs.python.org/3/library/warnings.html#warning-categories]
> In particular,
>  - we should use {{FutureWarning}} instead of {{DeprecationWarning}} for the 
> places we should show the warnings to end-users by default.
>  - we should __maybe__ think about customizing stacklevel 
> ([https://docs.python.org/3/library/warnings.html#warnings.warn]) like pandas 
> does.
>  - ...
> Current warnings are a bit messy and somewhat arbitrary.
> To be more explicit, we'll have to fix:
> {code:java}
> pyspark/context.py:warnings.warn(
> pyspark/context.py:warnings.warn(
> pyspark/ml/classification.py:warnings.warn("weightCol is 
> ignored, "
> pyspark/ml/clustering.py:warnings.warn("Deprecated in 3.0.0. It will 
> be removed in future versions. Use "
> pyspark/mllib/classification.py:warnings.warn(
> pyspark/mllib/feature.py:warnings.warn("Both withMean and withStd 
> are false. The model does nothing.")
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/rdd.py:warnings.warn("mapPartitionsWithSplit is deprecated; "
> pyspark/rdd.py:warnings.warn(
> pyspark/shell.py:warnings.warn("Failed to initialize Spark session.")
> pyspark/shuffle.py:warnings.warn("Please install psutil to have 
> better "
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn("to_replace is a dict 
> and value is not None. value will be ignored.")
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use degrees 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use radians 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use 
> approx_count_distinct instead.", DeprecationWarning)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/functions.py:warnings.warn(
> pyspark/sql/pandas/group_ops.py:warnings.warn(
> pyspark/sql/session.py:warnings.warn("Fall back to non-hive 
> support because failing to access HiveConf, "
> {code}
> PySpark prints warnings via using {{print}} in some places as well. We should 
> also see if we should switch and replace to {{warnings.warn}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33748) Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33748:


Assignee: (was: Apache Spark)

> Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables
> --
>
> Key: SPARK-33748
> URL: https://issues.apache.org/jira/browse/SPARK-33748
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See [https://github.com/apache/spark/pull/21092#discussion_r540240095.]
> We should respect PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON like we do in all 
> other places.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33748) Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247914#comment-17247914
 ] 

Apache Spark commented on SPARK-33748:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30735

> Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables
> --
>
> Key: SPARK-33748
> URL: https://issues.apache.org/jira/browse/SPARK-33748
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See [https://github.com/apache/spark/pull/21092#discussion_r540240095.]
> We should respect PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON like we do in all 
> other places.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33748) Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33748:


Assignee: Apache Spark

> Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables
> --
>
> Key: SPARK-33748
> URL: https://issues.apache.org/jira/browse/SPARK-33748
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> See [https://github.com/apache/spark/pull/21092#discussion_r540240095.]
> We should respect PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON like we do in all 
> other places.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33748) Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables

2020-12-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33748:
-
Summary: Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment 
variables  (was: Respect PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON)

> Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables
> --
>
> Key: SPARK-33748
> URL: https://issues.apache.org/jira/browse/SPARK-33748
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See [https://github.com/apache/spark/pull/21092#discussion_r540240095.]
> We should respect PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON like we do in all 
> other places.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33706) Require fully specified partition identifier in partitionExists()

2020-12-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33706:
---

Assignee: Maxim Gekk

> Require fully specified partition identifier in partitionExists()
> -
>
> Key: SPARK-33706
> URL: https://issues.apache.org/jira/browse/SPARK-33706
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently, partitionExists() from SupportsPartitionManagement accept any 
> partition identifier even which is not fully specified. This ticket aim to 
> add a check for the length of partition schema and partition identifier, and 
> require exact matching. So, we should prohibit not fully specified IDs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33706) Require fully specified partition identifier in partitionExists()

2020-12-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33706.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30667
[https://github.com/apache/spark/pull/30667]

> Require fully specified partition identifier in partitionExists()
> -
>
> Key: SPARK-33706
> URL: https://issues.apache.org/jira/browse/SPARK-33706
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, partitionExists() from SupportsPartitionManagement accept any 
> partition identifier even which is not fully specified. This ticket aim to 
> add a check for the length of partition schema and partition identifier, and 
> require exact matching. So, we should prohibit not fully specified IDs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33755) Allow creating orc table when row format separator is defined

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33755:


Assignee: (was: Apache Spark)

> Allow creating orc table when row format separator is defined
> -
>
> Key: SPARK-33755
> URL: https://issues.apache.org/jira/browse/SPARK-33755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: xiepengjie
>Priority: Major
>
> When creating table like this:
> {code:java}
> create table test_orc(c1 string) row format delimited fields terminated by 
> '002' stored as orcfile;
> {code}
> spark throws exception like :
> {code:java}
> Operation
>   not allowed: ROW FORMAT DELIMITED is only compatible with 'textfile', not
>   'orcfile'(line 2, pos 0)
> {code}
> I don’t think we need such strict rules, we can support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33755) Allow creating orc table when row format separator is defined

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247894#comment-17247894
 ] 

Apache Spark commented on SPARK-33755:
--

User 'StefanXiepj' has created a pull request for this issue:
https://github.com/apache/spark/pull/30734

> Allow creating orc table when row format separator is defined
> -
>
> Key: SPARK-33755
> URL: https://issues.apache.org/jira/browse/SPARK-33755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: xiepengjie
>Priority: Major
>
> When creating table like this:
> {code:java}
> create table test_orc(c1 string) row format delimited fields terminated by 
> '002' stored as orcfile;
> {code}
> spark throws exception like :
> {code:java}
> Operation
>   not allowed: ROW FORMAT DELIMITED is only compatible with 'textfile', not
>   'orcfile'(line 2, pos 0)
> {code}
> I don’t think we need such strict rules, we can support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33755) Allow creating orc table when row format separator is defined

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33755:


Assignee: (was: Apache Spark)

> Allow creating orc table when row format separator is defined
> -
>
> Key: SPARK-33755
> URL: https://issues.apache.org/jira/browse/SPARK-33755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: xiepengjie
>Priority: Major
>
> When creating table like this:
> {code:java}
> create table test_orc(c1 string) row format delimited fields terminated by 
> '002' stored as orcfile;
> {code}
> spark throws exception like :
> {code:java}
> Operation
>   not allowed: ROW FORMAT DELIMITED is only compatible with 'textfile', not
>   'orcfile'(line 2, pos 0)
> {code}
> I don’t think we need such strict rules, we can support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33755) Allow creating orc table when row format separator is defined

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33755:


Assignee: Apache Spark

> Allow creating orc table when row format separator is defined
> -
>
> Key: SPARK-33755
> URL: https://issues.apache.org/jira/browse/SPARK-33755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: xiepengjie
>Assignee: Apache Spark
>Priority: Major
>
> When creating table like this:
> {code:java}
> create table test_orc(c1 string) row format delimited fields terminated by 
> '002' stored as orcfile;
> {code}
> spark throws exception like :
> {code:java}
> Operation
>   not allowed: ROW FORMAT DELIMITED is only compatible with 'textfile', not
>   'orcfile'(line 2, pos 0)
> {code}
> I don’t think we need such strict rules, we can support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33654) Migrate CACHE TABLE to new resolution framework

2020-12-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33654:
---

Assignee: Terry Kim

> Migrate CACHE TABLE to new resolution framework
> ---
>
> Key: SPARK-33654
> URL: https://issues.apache.org/jira/browse/SPARK-33654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>
> Migrate CACHE TABLE to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33654) Migrate CACHE TABLE to new resolution framework

2020-12-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33654.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30598
[https://github.com/apache/spark/pull/30598]

> Migrate CACHE TABLE to new resolution framework
> ---
>
> Key: SPARK-33654
> URL: https://issues.apache.org/jira/browse/SPARK-33654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.2.0
>
>
> Migrate CACHE TABLE to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2020-12-11 Thread David Wyles (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247879#comment-17247879
 ] 

David Wyles commented on SPARK-33635:
-

[~gsomogyi] 
"try to turn off Kafka consumer caching. Apart from that there were no super 
significant changes which could cause this."

Whats the option for that?

 

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2020-12-11 Thread David Wyles (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247878#comment-17247878
 ] 

David Wyles commented on SPARK-33635:
-

"Since you're measuring speed I've ported the Kafka source from DSv1 to DSv2. 
DSv1 is the default but the DSv2 can be tried out by setting 
"spark.sql.sources.useV1SourceList" properly. If you can try it out I would 
appreciate it."

Is that availble already in the 3.0.1 build, or do I need to pull and build it 
myself?

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33635) Performance regression in Kafka read

2020-12-11 Thread David Wyles (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247875#comment-17247875
 ] 

David Wyles edited comment on SPARK-33635 at 12/11/20, 12:12 PM:
-

I'll give all those a go and get back to you.

 

The collect in this test case is only 13 items of data after the group by - so 
I know thats not going to impact it.

But I can modify it to just read and write to kafka.


was (Author: david.wyles):
I'll give all those a go and get back to you.

 

The collect in this test case is only 13 items of data after the group by - so 
I know thats not going to impact it.

But I can modify it to just read and write to kafka.

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33635) Performance regression in Kafka read

2020-12-11 Thread David Wyles (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247875#comment-17247875
 ] 

David Wyles edited comment on SPARK-33635 at 12/11/20, 12:11 PM:
-

I'll give all those a go and get back to you.

 

The collect in this test case is only 13 items of data - so I know thats not 
going to impact it.


was (Author: david.wyles):
I'll give all those a go and get back to you.

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33635) Performance regression in Kafka read

2020-12-11 Thread David Wyles (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247875#comment-17247875
 ] 

David Wyles edited comment on SPARK-33635 at 12/11/20, 12:11 PM:
-

I'll give all those a go and get back to you.

 

The collect in this test case is only 13 items of data after the group by - so 
I know thats not going to impact it.

But I can modify it to just read and write to kafka.


was (Author: david.wyles):
I'll give all those a go and get back to you.

 

The collect in this test case is only 13 items of data - so I know thats not 
going to impact it.

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2020-12-11 Thread David Wyles (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247875#comment-17247875
 ] 

David Wyles commented on SPARK-33635:
-

I'll give all those a go and get back to you.

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33691) Support partition filters in ALTER TABLE DROP PARTITION

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247870#comment-17247870
 ] 

Apache Spark commented on SPARK-33691:
--

User 'StefanXiepj' has created a pull request for this issue:
https://github.com/apache/spark/pull/30733

> Support partition filters in ALTER TABLE DROP PARTITION
> ---
>
> Key: SPARK-33691
> URL: https://issues.apache.org/jira/browse/SPARK-33691
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: xiepengjie
>Priority: Major
>
> User 'mgaido91' has created a pull request for this issue:
> [https://github.com/apache/spark/pull/20999]
> but i found some problems when using it, i tried to rewrite it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33691) Support partition filters in ALTER TABLE DROP PARTITION

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247871#comment-17247871
 ] 

Apache Spark commented on SPARK-33691:
--

User 'StefanXiepj' has created a pull request for this issue:
https://github.com/apache/spark/pull/30733

> Support partition filters in ALTER TABLE DROP PARTITION
> ---
>
> Key: SPARK-33691
> URL: https://issues.apache.org/jira/browse/SPARK-33691
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: xiepengjie
>Priority: Major
>
> User 'mgaido91' has created a pull request for this issue:
> [https://github.com/apache/spark/pull/20999]
> but i found some problems when using it, i tried to rewrite it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33742) Throw PartitionsAlreadyExistException from HiveExternalCatalog.createPartitions()

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247866#comment-17247866
 ] 

Apache Spark commented on SPARK-33742:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30732

> Throw PartitionsAlreadyExistException from 
> HiveExternalCatalog.createPartitions()
> -
>
> Key: SPARK-33742
> URL: https://issues.apache.org/jira/browse/SPARK-33742
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> HiveExternalCatalog.createPartitions throws AlreadyExistsException wrapped by 
> AnalysisException. The behavior deviates from V1/V2 in-memory catalogs that 
> throw PartitionsAlreadyExistException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33742) Throw PartitionsAlreadyExistException from HiveExternalCatalog.createPartitions()

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247850#comment-17247850
 ] 

Apache Spark commented on SPARK-33742:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30730

> Throw PartitionsAlreadyExistException from 
> HiveExternalCatalog.createPartitions()
> -
>
> Key: SPARK-33742
> URL: https://issues.apache.org/jira/browse/SPARK-33742
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> HiveExternalCatalog.createPartitions throws AlreadyExistsException wrapped by 
> AnalysisException. The behavior deviates from V1/V2 in-memory catalogs that 
> throw PartitionsAlreadyExistException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33730) Standardize warning types

2020-12-11 Thread Shril Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247827#comment-17247827
 ] 

Shril Kumar edited comment on SPARK-33730 at 12/11/20, 10:40 AM:
-

[~hyukjin.kwon], [~zero323]

I was inspecting the use of the print statements in PySpark. Please confirm if 
my assumption is correct.
{code:java}
# pyspark/worker.py

try:
(soft_limit, hard_limit) = resource.getrlimit(total_memory)
msg = "Current mem limits: {0} of max {1}\n".format(soft_limit, hard_limit)
print(msg, file=sys.stderr)# convert to bytes
new_limit = memory_limit_mb * 1024 * 1024if soft_limit == 
resource.RLIM_INFINITY or new_limit < soft_limit:
msg = "Setting mem limits to {0} of max {1}\n".format(new_limit, 
new_limit)
print(msg, file=sys.stderr)
resource.setrlimit(total_memory, (new_limit, new_limit))except 
(resource.error, OSError, ValueError) as e:
# not all systems support resource limits, so warn instead of failing
print("WARN: Failed to set memory limit: {0}\n".format(e), file=sys.stderr)
{code}
Do you want these print statements to be changed to warnings.warn?

 

 


was (Author: shril):
I was inspecting the use of the print statements in PySpark. Please confirm if 
my assumption is correct.

 
{code:java}
# pyspark/worker.py

try:
(soft_limit, hard_limit) = resource.getrlimit(total_memory)
msg = "Current mem limits: {0} of max {1}\n".format(soft_limit, hard_limit)
print(msg, file=sys.stderr)# convert to bytes
new_limit = memory_limit_mb * 1024 * 1024if soft_limit == 
resource.RLIM_INFINITY or new_limit < soft_limit:
msg = "Setting mem limits to {0} of max {1}\n".format(new_limit, 
new_limit)
print(msg, file=sys.stderr)
resource.setrlimit(total_memory, (new_limit, new_limit))except 
(resource.error, OSError, ValueError) as e:
# not all systems support resource limits, so warn instead of failing
print("WARN: Failed to set memory limit: {0}\n".format(e), file=sys.stderr)
{code}
Do you want these print statements to be changed to warnings.warn?

 

 

> Standardize warning types
> -
>
> Key: SPARK-33730
> URL: https://issues.apache.org/jira/browse/SPARK-33730
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should use warnings properly per 
> [https://docs.python.org/3/library/warnings.html#warning-categories]
> In particular,
>  - we should use {{FutureWarning}} instead of {{DeprecationWarning}} for the 
> places we should show the warnings to end-users by default.
>  - we should __maybe__ think about customizing stacklevel 
> ([https://docs.python.org/3/library/warnings.html#warnings.warn]) like pandas 
> does.
>  - ...
> Current warnings are a bit messy and somewhat arbitrary.
> To be more explicit, we'll have to fix:
> {code:java}
> pyspark/context.py:warnings.warn(
> pyspark/context.py:warnings.warn(
> pyspark/ml/classification.py:warnings.warn("weightCol is 
> ignored, "
> pyspark/ml/clustering.py:warnings.warn("Deprecated in 3.0.0. It will 
> be removed in future versions. Use "
> pyspark/mllib/classification.py:warnings.warn(
> pyspark/mllib/feature.py:warnings.warn("Both withMean and withStd 
> are false. The model does nothing.")
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/rdd.py:warnings.warn("mapPartitionsWithSplit is deprecated; "
> pyspark/rdd.py:warnings.warn(
> pyspark/shell.py:warnings.warn("Failed to initialize Spark session.")
> pyspark/shuffle.py:warnings.warn("Please install psutil to have 
> better "
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn("to_replace is a dict 
> and value is not None. value will be ignored.")
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use degrees 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use radians 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use 
> approx_count_distinct instead.", 

[jira] [Commented] (SPARK-33730) Standardize warning types

2020-12-11 Thread Shril Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247827#comment-17247827
 ] 

Shril Kumar commented on SPARK-33730:
-

I was inspecting the use of the print statements in PySpark. Please confirm if 
my assumption is correct.

 
{code:java}
# pyspark/worker.py

try:
(soft_limit, hard_limit) = resource.getrlimit(total_memory)
msg = "Current mem limits: {0} of max {1}\n".format(soft_limit, hard_limit)
print(msg, file=sys.stderr)# convert to bytes
new_limit = memory_limit_mb * 1024 * 1024if soft_limit == 
resource.RLIM_INFINITY or new_limit < soft_limit:
msg = "Setting mem limits to {0} of max {1}\n".format(new_limit, 
new_limit)
print(msg, file=sys.stderr)
resource.setrlimit(total_memory, (new_limit, new_limit))except 
(resource.error, OSError, ValueError) as e:
# not all systems support resource limits, so warn instead of failing
print("WARN: Failed to set memory limit: {0}\n".format(e), file=sys.stderr)
{code}
Do you want these print statements to be changed to warnings.warn?

 

 

> Standardize warning types
> -
>
> Key: SPARK-33730
> URL: https://issues.apache.org/jira/browse/SPARK-33730
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should use warnings properly per 
> [https://docs.python.org/3/library/warnings.html#warning-categories]
> In particular,
>  - we should use {{FutureWarning}} instead of {{DeprecationWarning}} for the 
> places we should show the warnings to end-users by default.
>  - we should __maybe__ think about customizing stacklevel 
> ([https://docs.python.org/3/library/warnings.html#warnings.warn]) like pandas 
> does.
>  - ...
> Current warnings are a bit messy and somewhat arbitrary.
> To be more explicit, we'll have to fix:
> {code:java}
> pyspark/context.py:warnings.warn(
> pyspark/context.py:warnings.warn(
> pyspark/ml/classification.py:warnings.warn("weightCol is 
> ignored, "
> pyspark/ml/clustering.py:warnings.warn("Deprecated in 3.0.0. It will 
> be removed in future versions. Use "
> pyspark/mllib/classification.py:warnings.warn(
> pyspark/mllib/feature.py:warnings.warn("Both withMean and withStd 
> are false. The model does nothing.")
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/mllib/regression.py:warnings.warn(
> pyspark/rdd.py:warnings.warn("mapPartitionsWithSplit is deprecated; "
> pyspark/rdd.py:warnings.warn(
> pyspark/shell.py:warnings.warn("Failed to initialize Spark session.")
> pyspark/shuffle.py:warnings.warn("Please install psutil to have 
> better "
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/catalog.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/column.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/context.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn(
> pyspark/sql/dataframe.py:warnings.warn("to_replace is a dict 
> and value is not None. value will be ignored.")
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use degrees 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use radians 
> instead.", DeprecationWarning)
> pyspark/sql/functions.py:warnings.warn("Deprecated in 2.1, use 
> approx_count_distinct instead.", DeprecationWarning)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/conversion.py:warnings.warn(msg)
> pyspark/sql/pandas/functions.py:warnings.warn(
> pyspark/sql/pandas/group_ops.py:warnings.warn(
> pyspark/sql/session.py:warnings.warn("Fall back to non-hive 
> support because failing to access HiveConf, "
> {code}
> PySpark prints warnings via using {{print}} in some places as well. We should 
> also see if we should switch and replace to {{warnings.warn}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33742) Throw PartitionsAlreadyExistException from HiveExternalCatalog.createPartitions()

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247822#comment-17247822
 ] 

Apache Spark commented on SPARK-33742:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30729

> Throw PartitionsAlreadyExistException from 
> HiveExternalCatalog.createPartitions()
> -
>
> Key: SPARK-33742
> URL: https://issues.apache.org/jira/browse/SPARK-33742
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> HiveExternalCatalog.createPartitions throws AlreadyExistsException wrapped by 
> AnalysisException. The behavior deviates from V1/V2 in-memory catalogs that 
> throw PartitionsAlreadyExistException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33542) Group exception messages in catalyst/catalog

2020-12-11 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247818#comment-17247818
 ] 

jiaan.geng commented on SPARK-33542:


I'm working on.

> Group exception messages in catalyst/catalog
> 
>
> Key: SPARK-33542
> URL: https://issues.apache.org/jira/browse/SPARK-33542
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Group all exception messages in 
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog.
> ||Filename||Count||
> |ExternalCatalog.scala|4|
> |GlobalTempViewManager.scala|1|
> |InMemoryCatalog.scala|18|
> |SessionCatalog.scala|17|
> |functionResources.scala|1|
> |interface.scala|4|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-12-11 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-28367:
--
Fix Version/s: 3.1.0

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Critical
> Fix For: 3.1.0
>
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-12-11 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-28367.
-

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-12-11 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-28367.
---
Resolution: Fixed

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-12-11 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247813#comment-17247813
 ] 

Gabor Somogyi commented on SPARK-28367:
---

The issue solved in subtasks so closing this.

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33756) BytesToBytesMap's iterator hasNext method should be idempotent.

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33756:


Assignee: Apache Spark

> BytesToBytesMap's iterator hasNext method should be idempotent.
> ---
>
> Key: SPARK-33756
> URL: https://issues.apache.org/jira/browse/SPARK-33756
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Assignee: Apache Spark
>Priority: Minor
>
> BytesToBytesMap's MapIterator's hasNext method is not idempotent. 
> {code:java}
> // 
> public boolean hasNext() {
>   if (numRecords == 0) {
> if (reader != null) {
>   // if called multiple multiple times, it will throw NoSuchElement 
> exception
>   handleFailedDelete();
> }
>   }
>   return numRecords > 0;
> }
> {code}
> Multiple calls to this `hasNext` method will call `handleFailedDelete()` 
> multiple times, which will throw NoSuchElementException  as the spillWrites 
> has already been empty.
>  
> We observed this issue for in one of our production jobs after upgrading to 
> Spark 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33756) BytesToBytesMap's iterator hasNext method should be idempotent.

2020-12-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247802#comment-17247802
 ] 

Apache Spark commented on SPARK-33756:
--

User 'advancedxy' has created a pull request for this issue:
https://github.com/apache/spark/pull/30728

> BytesToBytesMap's iterator hasNext method should be idempotent.
> ---
>
> Key: SPARK-33756
> URL: https://issues.apache.org/jira/browse/SPARK-33756
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Minor
>
> BytesToBytesMap's MapIterator's hasNext method is not idempotent. 
> {code:java}
> // 
> public boolean hasNext() {
>   if (numRecords == 0) {
> if (reader != null) {
>   // if called multiple multiple times, it will throw NoSuchElement 
> exception
>   handleFailedDelete();
> }
>   }
>   return numRecords > 0;
> }
> {code}
> Multiple calls to this `hasNext` method will call `handleFailedDelete()` 
> multiple times, which will throw NoSuchElementException  as the spillWrites 
> has already been empty.
>  
> We observed this issue for in one of our production jobs after upgrading to 
> Spark 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33756) BytesToBytesMap's iterator hasNext method should be idempotent.

2020-12-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33756:


Assignee: (was: Apache Spark)

> BytesToBytesMap's iterator hasNext method should be idempotent.
> ---
>
> Key: SPARK-33756
> URL: https://issues.apache.org/jira/browse/SPARK-33756
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Minor
>
> BytesToBytesMap's MapIterator's hasNext method is not idempotent. 
> {code:java}
> // 
> public boolean hasNext() {
>   if (numRecords == 0) {
> if (reader != null) {
>   // if called multiple multiple times, it will throw NoSuchElement 
> exception
>   handleFailedDelete();
> }
>   }
>   return numRecords > 0;
> }
> {code}
> Multiple calls to this `hasNext` method will call `handleFailedDelete()` 
> multiple times, which will throw NoSuchElementException  as the spillWrites 
> has already been empty.
>  
> We observed this issue for in one of our production jobs after upgrading to 
> Spark 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33754) Update kubernetes/integration-tests/README.md to follow the default Hadoop profile updated

2020-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33754.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30726
[https://github.com/apache/spark/pull/30726]

> Update kubernetes/integration-tests/README.md to follow the default Hadoop 
> profile updated
> --
>
> Key: SPARK-33754
> URL: https://issues.apache.org/jira/browse/SPARK-33754
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> kubernetes/integration-tests/README.md says about how to run the integration 
> tests for Kubernetes as follows.
> {code}
> To run tests with Hadoop 3.2 instead of Hadoop 2.7, use `--hadoop-profile`.
> ./dev/dev-run-integration-tests.sh --hadoop-profile hadoop-2.7
> {code}
> In the current master, the default Hadoop profile is hadoop-3.2 so it's 
> better to update the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33755) Allow creating orc table when row format separator is defined

2020-12-11 Thread xiepengjie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiepengjie updated SPARK-33755:
---
Description: 
When creating table like this:
{code:java}
create table test_orc(c1 string) row format delimited fields terminated by 
'002' stored as orcfile;
{code}
spark throws exception like :
{code:java}
Operation
  not allowed: ROW FORMAT DELIMITED is only compatible with 'textfile', not
  'orcfile'(line 2, pos 0)
{code}
I don’t think we need such strict rules, we can support it.

> Allow creating orc table when row format separator is defined
> -
>
> Key: SPARK-33755
> URL: https://issues.apache.org/jira/browse/SPARK-33755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: xiepengjie
>Priority: Major
>
> When creating table like this:
> {code:java}
> create table test_orc(c1 string) row format delimited fields terminated by 
> '002' stored as orcfile;
> {code}
> spark throws exception like :
> {code:java}
> Operation
>   not allowed: ROW FORMAT DELIMITED is only compatible with 'textfile', not
>   'orcfile'(line 2, pos 0)
> {code}
> I don’t think we need such strict rules, we can support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >