[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials

2020-04-23 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Description: 
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *SPARK_USER* only gets the 
UserGroupInformation.getCurrentUser().getShortUserName() of the user, which may 
lost the user's fully qualified user name. We should better use the 
*getUserName* to get fully qualified user name in our client side, which is 
aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.

Related to https://issues.apache.org/jira/browse/SPARK-1051

  was:
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *SPARK_USER* only returns the getShortUserName of the 
user, which may lost the user's fully qualified user name that need to be 
passed to PRC server (such as YARN, HDFS, Kafka). We should better use the 
*getUserName* to get fully qualified user name in our client side, which is 
aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.

Related to https://issues.apache.org/jira/browse/SPARK-1051


> createSparkUser lost user's non-Hadoop credentials
> --
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/

[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name

2020-04-23 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Description: 
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *SPARK_USER* only returns the getShortUserName of the 
user, which may lost the user's fully qualified user name that need to be 
passed to PRC server (such as YARN, HDFS, Kafka). We should better use the 
*getUserName* to get fully qualified user name in our client side, which is 
aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.

Related to https://issues.apache.org/jira/browse/SPARK-1051

  was:
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side, which is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.

Related to https://issues.apache.org/jira/browse/SPARK-1051


> createSparkUser lost user's non-Hadoop credentials and fully qualified user 
> name
> 
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> *[createSparkUser|htt

[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials

2020-04-23 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Summary: createSparkUser lost user's non-Hadoop credentials  (was: 
createSparkUser lost user's non-Hadoop credentials and fully qualified user 
name)

> createSparkUser lost user's non-Hadoop credentials
> --
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
> {code:java}
>    def createSparkUser(): UserGroupInformation = {
> val user = Utils.getCurrentUserName()
> logDebug("creating UGI for user: " + user)
> val ugi = UserGroupInformation.createRemoteUser(user)
> transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
> ugi
>   }
>   def transferCredentials(source: UserGroupInformation, dest: 
> UserGroupInformation): Unit = {
> dest.addCredentials(source.getCredentials())
>   }
>   def getCurrentUserName(): String = {
> Option(System.getenv("SPARK_USER"))
>   .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
>   }
> {code}
> The *transferCredentials* func can only transfer Hadoop creds such as 
> Delegation Tokens.
>  However, other creds stored in UGI.subject.getPrivateCredentials, will be 
> lost here, such as:
>  # Non-Hadoop creds:
>  Such as, [Kafka creds 
> |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
>  # Newly supported or 3rd party supported Hadoop creds:
>  Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
> OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
> are not supposed to be managed by Hadoop Credentials (currently it is only 
> for Hadoop secret keys and delegation tokens)
> Another issue is that the *SPARK_USER* only returns the getShortUserName of 
> the user, which may lost the user's fully qualified user name that need to be 
> passed to PRC server (such as YARN, HDFS, Kafka). We should better use the 
> *getUserName* to get fully qualified user name in our client side, which is 
> aligned to 
> *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.
> Related to https://issues.apache.org/jira/browse/SPARK-1051



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-23 Thread Shashanka Balakuntala Srinivasa (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091232#comment-17091232
 ] 

Shashanka Balakuntala Srinivasa commented on SPARK-31463:
-

Hi [~hyukjin.kwon], I will start looking into this. Thanks.

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091230#comment-17091230
 ] 

Hyukjin Kwon commented on SPARK-31463:
--

Separate source might be ideal. We can start it from separate project and 
gradually move it into Apache Spark when it's proven very useful later.

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31438) Support JobCleaned Status in SparkListener

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091225#comment-17091225
 ] 

Hyukjin Kwon commented on SPARK-31438:
--

PR https://github.com/apache/spark/pull/28280

> Support JobCleaned Status in SparkListener
> --
>
> Key: SPARK-31438
> URL: https://issues.apache.org/jira/browse/SPARK-31438
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Jackey Lee
>Priority: Major
>
> In Spark, we need do some hook after job cleaned, such as cleaning hive 
> external temporary paths. This has already discussed in SPARK-31346 and 
> [GitHub Pull Request #28129.|https://github.com/apache/spark/pull/28129]
>  The JobEnd Status is not suitable for this. As JobEnd is responsible for Job 
> finished, once all result has generated, it should be finished. After finish, 
> Scheduler will leave the still running tasks to be zombie tasks and delete 
> abnormal tasks asynchronously.
>  Thus, we add JobCleaned Status to enable user to do some hook after all 
> tasks cleaned in Job. The JobCleaned Status can get from TaskSetManagers, 
> which is related to a stage, and once all stages of the job has been cleaned, 
> then the job is cleaned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31453) Error while converting JavaRDD to Dataframe

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31453.
--
Resolution: Duplicate

It duplicates SPARK-23862. See SPARK-21255 for the workaround


> Error while converting JavaRDD to Dataframe
> ---
>
> Key: SPARK-31453
> URL: https://issues.apache.org/jira/browse/SPARK-31453
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Sachit Sharma
>Priority: Trivial
>
> Please refer to this: 
> [https://stackoverflow.com/questions/61172007/error-while-converting-javardd-to-dataframe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc

2020-04-23 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091220#comment-17091220
 ] 

Kent Yao commented on SPARK-31550:
--

Github webhook temporary downtime, PR in progress: 
[https://github.com/apache/spark/pull/28322]

> nondeterministic configurations with general meanings in sql configuration doc
> --
>
> Key: SPARK-31550
> URL: https://issues.apache.org/jira/browse/SPARK-31550
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> spark.sql.session.timeZone
> spark.sql.warehouse.dir
>  
> these 2 configs are nondeterministic and vary with environments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name

2020-04-23 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Description: 
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side, which is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.

Related to https://issues.apache.org/jira/browse/SPARK-1051

  was:
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side, which is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.


> createSparkUser lost user's non-Hadoop credentials and fully qualified user 
> name
> 
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10

[jira] [Commented] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc

2020-04-23 Thread JinxinTang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091218#comment-17091218
 ] 

JinxinTang commented on SPARK-31550:


try to specify conf in spark-defaults.conf

spark.sql.warehouse.dir /tmp
spark.sql.session.timeZone America/New_York

It not seems a bug

> nondeterministic configurations with general meanings in sql configuration doc
> --
>
> Key: SPARK-31550
> URL: https://issues.apache.org/jira/browse/SPARK-31550
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> spark.sql.session.timeZone
> spark.sql.warehouse.dir
>  
> these 2 configs are nondeterministic and vary with environments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name

2020-04-23 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Description: 
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side, which is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.

  was:
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side, which is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*.


> createSparkUser lost user's non-Hadoop credentials and fully qualified user 
> name
> 
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/

[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name

2020-04-23 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Description: 
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side. This is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*.

  was:
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side. This is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*


> createSparkUser lost user's non-Hadoop credentials and fully qualified user 
> name
> 
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHad

[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name

2020-04-23 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Description: 
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side, which is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*.

  was:
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side. This is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*.


> createSparkUser lost user's non-Hadoop credentials and fully qualified user 
> name
> 
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkH

[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name

2020-04-23 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Description: 
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side. This is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*

  was:
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:

 
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side. This is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*


> createSparkUser lost user's non-Hadoop credentials and fully qualified user 
> name
> 
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkH

[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name

2020-04-23 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Description: 
See current 
*[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:

 
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, [Kafka creds 
|https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side. This is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*

  was:
See current *createSparkUser*:

[https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, Kafka creds, 
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side. This is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*


> createSparkUser lost user's non-Hadoop credentials and fully qualified user 
> name
> 
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
>  
> {code:java}
>    def createSparkUser(): UserGroupInformation = {
> val user = Utils.getCurrentUserName()
> logDebug("creating UGI for user:

[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name

2020-04-23 Thread Yuqi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated SPARK-31551:
--
Description: 
See current *createSparkUser*:

[https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]
{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}
The *transferCredentials* func can only transfer Hadoop creds such as 
Delegation Tokens.
 However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
 # Non-Hadoop creds:
 Such as, Kafka creds, 
 # Newly supported or 3rd party supported Hadoop creds:
 Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently it is only for 
Hadoop secret keys and delegation tokens)

Another issue is that the *getCurrentUserName* only returns the 
getShortUserName of the user, which may lost the user's fully qualified user 
name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We 
should better use the *getUserName* to get fully qualified user name in our 
client side. This is aligned to 
*[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*

  was:
Current createRemoteUser:

[https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]

{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}

The transferCredentials func can only transfer Hadoop creds such as Delegation 
Tokens.
However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
1. Non-Hadoop creds:
Such as, Kafka creds, 
https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395

2. Customized Hadoop creds:
Such as support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently only for Hadoop 
secret keys and delegation Tokens)

Another issue is that the getCurrentUserName only returns the getShortUserName 
of the user, which may lost the user's fully qualified user name that need to 
be passed to PRC server (such as YARN, HDFS, Kafka). We should use getUserName 
to get fully qualified user name in our client side.




> createSparkUser lost user's non-Hadoop credentials and fully qualified user 
> name
> 
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current *createSparkUser*:
> [https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]
> {code:java}
>    def createSparkUser(): UserGroupInformation = {
> val user = Utils.getCurrentUserName()
> logDebug("creating UGI for user: " + user)
> val ugi = UserGroupInformation.createRemoteUser(user)
> transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
> ugi
>   }
>   def transferCredentials(source: UserGroupInformation, dest: 
> UserGroupInformation): Unit = {
> dest.addCredentials(source.

[jira] [Created] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name

2020-04-23 Thread Yuqi Wang (Jira)
Yuqi Wang created SPARK-31551:
-

 Summary: createSparkUser lost user's non-Hadoop credentials and 
fully qualified user name
 Key: SPARK-31551
 URL: https://issues.apache.org/jira/browse/SPARK-31551
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.5, 2.4.4
Reporter: Yuqi Wang


Current createRemoteUser:

[https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]

{code:java}
   def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
logDebug("creating UGI for user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi
  }

  def transferCredentials(source: UserGroupInformation, dest: 
UserGroupInformation): Unit = {
dest.addCredentials(source.getCredentials())
  }

  def getCurrentUserName(): String = {
Option(System.getenv("SPARK_USER"))
  .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }
{code}

The transferCredentials func can only transfer Hadoop creds such as Delegation 
Tokens.
However, other creds stored in UGI.subject.getPrivateCredentials, will be lost 
here, such as:
1. Non-Hadoop creds:
Such as, Kafka creds, 
https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395

2. Customized Hadoop creds:
Such as support OAuth/JWT token authn on Hadoop, we need to store the 
OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
are not supposed to be managed by Hadoop Credentials (currently only for Hadoop 
secret keys and delegation Tokens)

Another issue is that the getCurrentUserName only returns the getShortUserName 
of the user, which may lost the user's fully qualified user name that need to 
be passed to PRC server (such as YARN, HDFS, Kafka). We should use getUserName 
to get fully qualified user name in our client side.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc

2020-04-23 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-31550:
-
Component/s: Documentation

> nondeterministic configurations with general meanings in sql configuration doc
> --
>
> Key: SPARK-31550
> URL: https://issues.apache.org/jira/browse/SPARK-31550
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> spark.sql.session.timeZone
> spark.sql.warehouse.dir
>  
> these 2 configs are nondeterministic and vary with environments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc

2020-04-23 Thread Kent Yao (Jira)
Kent Yao created SPARK-31550:


 Summary: nondeterministic configurations with general meanings in 
sql configuration doc
 Key: SPARK-31550
 URL: https://issues.apache.org/jira/browse/SPARK-31550
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


spark.sql.session.timeZone

spark.sql.warehouse.dir

 

these 2 configs are nondeterministic and vary with environments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-23 Thread Steven Moy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091205#comment-17091205
 ] 

Steven Moy commented on SPARK-31463:


Hi [~hyukjin.kwon], 

What's Spark recommended path on introducing C code? I was following SQLite and 
DuckDB, their approach is to inline the dependency (bring the code in in the 
case of compatible license). 

Or would it better to support simdjson as a compeletely separate DataSourcev2 
implementation?

simdjson license is Apache License as well; 
[https://github.com/simdjson/simdjson/blob/master/LICENSE]

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31461) RLIKE and LIKE expression compiles every time when it used

2020-04-23 Thread Linpx (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091201#comment-17091201
 ] 

Linpx commented on SPARK-31461:
---

i did search,didn't found any

> RLIKE and LIKE expression compiles every time when it used
> --
>
> Key: SPARK-31461
> URL: https://issues.apache.org/jira/browse/SPARK-31461
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Linpx
>Priority: Minor
>
> org.apache.spark.sql.catalyst.expressions
> regexpExpressions.scala
> line: 41
> {code:scala}
>    // try cache the pattern for Literal
>   private lazy val cache: Pattern = right match {
> case x @ Literal(value: String, StringType) => compile(value)
> case _ => null
>   }
> {code}
> StringType Literal value is UTF8String by default



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31461) RLIKE and LIKE expression compiles every time when it used

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31461.
--
Resolution: Duplicate

> RLIKE and LIKE expression compiles every time when it used
> --
>
> Key: SPARK-31461
> URL: https://issues.apache.org/jira/browse/SPARK-31461
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Linpx
>Priority: Minor
>
> org.apache.spark.sql.catalyst.expressions
> regexpExpressions.scala
> line: 41
> {code:scala}
>    // try cache the pattern for Literal
>   private lazy val cache: Pattern = right match {
> case x @ Literal(value: String, StringType) => compile(value)
> case _ => null
>   }
> {code}
> StringType Literal value is UTF8String by default



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31461) RLIKE and LIKE expression compiles every time when it used

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091200#comment-17091200
 ] 

Hyukjin Kwon commented on SPARK-31461:
--

Please search existing jiras first before filing it.

> RLIKE and LIKE expression compiles every time when it used
> --
>
> Key: SPARK-31461
> URL: https://issues.apache.org/jira/browse/SPARK-31461
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Linpx
>Priority: Minor
>
> org.apache.spark.sql.catalyst.expressions
> regexpExpressions.scala
> line: 41
> {code:scala}
>    // try cache the pattern for Literal
>   private lazy val cache: Pattern = right match {
> case x @ Literal(value: String, StringType) => compile(value)
> case _ => null
>   }
> {code}
> StringType Literal value is UTF8String by default



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091199#comment-17091199
 ] 

Hyukjin Kwon commented on SPARK-31463:
--

So it's about vectorization, right? I think [~maxgekk] talked about 
vectorization somewhere.
My biggest concern is that if it's right to bring the C library into Spark as a 
dependency or not. 

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31463:
-
Component/s: (was: Spark Core)
 SQL

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31466) String/Int to VarcharType cast not supported in Spark

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091196#comment-17091196
 ] 

Hyukjin Kwon commented on SPARK-31466:
--

Can you read the doc for the class?

{quote}
 * Hive varchar type. Similar to other HiveStringType's, these datatypes should 
only used for
 * parsing, and should NOT be used anywhere else. Any instance of these data 
types should be
 * replaced by a [[StringType]] before analysis.
{quote}

It's not supposed to use as an API.

> String/Int  to VarcharType  cast not supported in Spark
> ---
>
> Key: SPARK-31466
> URL: https://issues.apache.org/jira/browse/SPARK-31466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Gourav Choubey
>Priority: Major
>
> While casting a string column to varchar it does not do the casting at all 
> and column remains string.
>  
>  I tried to achieve it through VarcharType as below but it errors :
> for(i<-colList)
> { Df=Df.withColumn(i,Df(i).cast(VarcharType(1000))) }
> *org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`ColName AS 
> VARCHAR(1000))' due to data type mismatch: cannot cast string to 
> varchar(1000);;*
>  
> *Also, I tried through selectExpr cast option but no success.*
>  
> While trying to create an empty dataframe  with VarcharType also throws an 
> error
> *scala> var empty_df = spark.createDataFrame(sc.emptyRDD[Row], schema_rdd)*
> *scala.MatchError: VarcharType(1000) (of class 
> org.apache.spark.sql.types.VarcharType)*
>  *at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.org$apache$spark$sql$catalyst$enco*
>  
> Please suggest a way to cast a string to varchar in spark. As a reading 
> string column in SAS application has performance implications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31466) String/Int to VarcharType cast not supported in Spark

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31466.
--
Resolution: Invalid

> String/Int  to VarcharType  cast not supported in Spark
> ---
>
> Key: SPARK-31466
> URL: https://issues.apache.org/jira/browse/SPARK-31466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Gourav Choubey
>Priority: Major
>
> While casting a string column to varchar it does not do the casting at all 
> and column remains string.
>  
>  I tried to achieve it through VarcharType as below but it errors :
> for(i<-colList)
> { Df=Df.withColumn(i,Df(i).cast(VarcharType(1000))) }
> *org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`ColName AS 
> VARCHAR(1000))' due to data type mismatch: cannot cast string to 
> varchar(1000);;*
>  
> *Also, I tried through selectExpr cast option but no success.*
>  
> While trying to create an empty dataframe  with VarcharType also throws an 
> error
> *scala> var empty_df = spark.createDataFrame(sc.emptyRDD[Row], schema_rdd)*
> *scala.MatchError: VarcharType(1000) (of class 
> org.apache.spark.sql.types.VarcharType)*
>  *at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.org$apache$spark$sql$catalyst$enco*
>  
> Please suggest a way to cast a string to varchar in spark. As a reading 
> string column in SAS application has performance implications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31479) Numbers with thousands separator or locale specific decimal separator not parsed correctly

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31479.
--
Resolution: Duplicate

> Numbers with thousands separator or locale specific decimal separator not 
> parsed correctly
> --
>
> Key: SPARK-31479
> URL: https://issues.apache.org/jira/browse/SPARK-31479
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Ranjit Iyer
>Priority: Major
>
> CSV files that contain numbers with thousands separator (or locale specific 
> decimal separators) are not parsed correctly and are reported as {{null.}}
> [https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html]
> A user in France might expect "10,100" to be parsed as a float while a user 
> in the US might want Spark to interpret it as an Integer value (10100). 
> UnivocityParser is not locale aware and must use NumberFormatter to parse 
> string values to Numbers. 
> *US Locale*
> {{scala>Source.fromFile("/Users/ranjit.iyer/work/data/us.csv").getLines.mkString("\n")}}
>  {{res28: String =}}
>  {{"Value"}}
>  {{"10,000"}}
>  {{"20,000"}}
> {{scala> Locale.setDefault(Locale.US)}}
> {{scala> val _schema = StructType(StructField("Value", IntegerType, true) :: 
> Nil)}}
> {{scala> val df = spark.read.format("csv").option("header", 
> "true").schema(_schema).load("/Users/ranjit.iyer/work/data/us.csv")}}
>  {{df: org.apache.spark.sql.DataFrame = [Value: int]}}
> {{scala> df.show}}
>  {{+-+}}
>  {{|Value|}}
>  {{+-+}}
>  {{| null|}}
>  {{| null|}}
>  {{+-+}}
> *French Local* 
> {{scala> 
> Source.fromFile("/Users/ranjit.iyer/work/data/fr.csv").getLines.mkString("\n")}}
>  {{res43: String =}}
>  {{"Value"}}
>  {{"10,123"}}
>  {{"20,456"}}
> {{scala> Locale.setDefault(Locale.FRANCE)}}
> {{scala> val _schema = StructType(StructField("Value", FloatType, true) :: 
> Nil)}}
> {{scala> val df = spark.read.format("csv").option("header", 
> "true").schema(_schema).load("/Users/ranjit.iyer/work/data/fr.csv")}}
>  {{df: org.apache.spark.sql.DataFrame = [Value: float]}}
> {{scala> df.show}}
>  {{+-+}}
>  {{|Value|}}
>  {{+-+}}
>  {{| null|}}
>  {{| null|}}
>  {{+-+}}
> The fix is to use a NumberFormatter and I have it working locally and will 
> raise a PR for review.
> {{NumberFormat.getInstance.parse(_).intValue()}} 
> Thousands separator are quite commonly found on the internet. My workflow has 
> been to copy to Excel, export to csv and analyze in Spark.
> [https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31479) Numbers with thousands separator or locale specific decimal separator not parsed correctly

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091193#comment-17091193
 ] 

Hyukjin Kwon commented on SPARK-31479:
--

Use locale option. See SPARK-25945

> Numbers with thousands separator or locale specific decimal separator not 
> parsed correctly
> --
>
> Key: SPARK-31479
> URL: https://issues.apache.org/jira/browse/SPARK-31479
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Ranjit Iyer
>Priority: Major
>
> CSV files that contain numbers with thousands separator (or locale specific 
> decimal separators) are not parsed correctly and are reported as {{null.}}
> [https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html]
> A user in France might expect "10,100" to be parsed as a float while a user 
> in the US might want Spark to interpret it as an Integer value (10100). 
> UnivocityParser is not locale aware and must use NumberFormatter to parse 
> string values to Numbers. 
> *US Locale*
> {{scala>Source.fromFile("/Users/ranjit.iyer/work/data/us.csv").getLines.mkString("\n")}}
>  {{res28: String =}}
>  {{"Value"}}
>  {{"10,000"}}
>  {{"20,000"}}
> {{scala> Locale.setDefault(Locale.US)}}
> {{scala> val _schema = StructType(StructField("Value", IntegerType, true) :: 
> Nil)}}
> {{scala> val df = spark.read.format("csv").option("header", 
> "true").schema(_schema).load("/Users/ranjit.iyer/work/data/us.csv")}}
>  {{df: org.apache.spark.sql.DataFrame = [Value: int]}}
> {{scala> df.show}}
>  {{+-+}}
>  {{|Value|}}
>  {{+-+}}
>  {{| null|}}
>  {{| null|}}
>  {{+-+}}
> *French Local* 
> {{scala> 
> Source.fromFile("/Users/ranjit.iyer/work/data/fr.csv").getLines.mkString("\n")}}
>  {{res43: String =}}
>  {{"Value"}}
>  {{"10,123"}}
>  {{"20,456"}}
> {{scala> Locale.setDefault(Locale.FRANCE)}}
> {{scala> val _schema = StructType(StructField("Value", FloatType, true) :: 
> Nil)}}
> {{scala> val df = spark.read.format("csv").option("header", 
> "true").schema(_schema).load("/Users/ranjit.iyer/work/data/fr.csv")}}
>  {{df: org.apache.spark.sql.DataFrame = [Value: float]}}
> {{scala> df.show}}
>  {{+-+}}
>  {{|Value|}}
>  {{+-+}}
>  {{| null|}}
>  {{| null|}}
>  {{+-+}}
> The fix is to use a NumberFormatter and I have it working locally and will 
> raise a PR for review.
> {{NumberFormat.getInstance.parse(_).intValue()}} 
> Thousands separator are quite commonly found on the internet. My workflow has 
> been to copy to Excel, export to csv and analyze in Spark.
> [https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31493) Optimize InSet to In according partition size at InSubqueryExec

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091190#comment-17091190
 ] 

Hyukjin Kwon commented on SPARK-31493:
--

PR: https://github.com/apache/spark/pull/28269

> Optimize InSet to In according partition size at InSubqueryExec
> ---
>
> Key: SPARK-31493
> URL: https://issues.apache.org/jira/browse/SPARK-31493
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31482) spark.kubernetes.driver.podTemplateFile Configuration not used by the job

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31482:
-
Priority: Major  (was: Blocker)

> spark.kubernetes.driver.podTemplateFile Configuration not used by the job
> -
>
> Key: SPARK-31482
> URL: https://issues.apache.org/jira/browse/SPARK-31482
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Pradeep Misra
>Priority: Major
>
> Spark 3.0 - Running Spark Submit as below and point to a MinKube cluster
> {code:java}
> bin/spark-submit \
>  --master k8s://https://192.168.99.102:8443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.kubernetes.driver.podTemplateFile=../driver_1E.template \
>  --conf spark.kubernetes.executor.podTemplateFile=../executor.template \
>  --conf spark.kubernetes.container.image=spark:spark3 \
>  local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-preview2.jar 1
> {code}
>  
> Spark Binaries - spark-3.0.0-preview2-bin-hadoop2.7.tgz
> Driver Template - 
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
>   labels:
>     spark-app-id: my-custom-id
>   annotations:
>     spark-driver-cpu: 1
>     spark-driver-mem: 1
>     spark-executor-cpu: 1
>     spark-executor-mem: 1
>     spark-executor-count: 1
> spec:
>   schedulerName: spark-scheduler{code}
>  Executor Template
>  
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
>   labels:
>     spark-app-id: my-custom-id
> spec:
>   schedulerName: spark-scheduler{code}
> Kubernetes Pods Launched - Two Executor Pods were launched which was default
> {code:java}
> spark-pi-e608e7718f11cc69-driver   1/1     Running     0          10s
> spark-pi-e608e7718f11cc69-exec-1   1/1     Running     0          5s
> spark-pi-e608e7718f11cc69-exec-2   1/1     Running     0          5s{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31496) Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31496.
--
Resolution: Invalid

> Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError
> 
>
> Key: SPARK-31496
> URL: https://issues.apache.org/jira/browse/SPARK-31496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Windows 10 (1909)
> JDK 11.0.6
> spark-3.0.0-preview2-bin-hadoop3.2
> local[1]
>  
>  
>Reporter: Tomas Shestakov
>Priority: Major
>  Labels: out-of-memory
>
> Local spark with one core (local[1]) while trying to save Dataset to parquet 
> local file cause OOM. 
> {code:java}
> SparkSession sparkSession = SparkSession.builder()
> .appName("Loader impl test")
> .master("local[1]")
> .config("spark.ui.enabled", false)
> .config("spark.sql.datetime.java8API.enabled", true)
> .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> .config("spark.kryoserializer.buffer.max", "1g")
> .config("spark.executor.memory", "4g")
> .config("spark.driver.memory", "8g")
> .getOrCreate();
> {code}
> {noformat}
> [20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
> o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
> committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> [20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
> o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
> committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> [20-Apr-2020 11:42:27.967]  INFO [boundedElastic-2 
> o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output 
> Committer Algorithm version is 1
> [20-Apr-2020 11:42:27.969]  INFO [boundedElastic-2 
> o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> [20-Apr-2020 11:42:27.970]  INFO [boundedElastic-2 
> o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output 
> Committer Algorithm version is 1
> [20-Apr-2020 11:42:27.973]  INFO [boundedElastic-2 
> o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using output committer 
> class org.apache.parquet.hadoop.ParquetOutputCommitter
> [20-Apr-2020 11:42:34.371]  INFO [boundedElastic-2 
> org.apache.spark.SparkContext:57] q: - Starting job: save at 
> LoaderImpl.java:305
> [20-Apr-2020 11:42:34.389]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.DAGScheduler:57] q: - Got job 0 (save at 
> LoaderImpl.java:305) with 1 output partitions
> [20-Apr-2020 11:42:34.390]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.DAGScheduler:57] q: - Final stage: ResultStage 0 
> (save at LoaderImpl.java:305)
> [20-Apr-2020 11:42:34.390]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.DAGScheduler:57] q: - Parents of final stage: 
> List()
> [20-Apr-2020 11:42:34.392]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.DAGScheduler:57] q: - Missing parents: 
> List()[20-Apr-2020 11:42:34.398]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting ResultStage 0 
> (MapPartitionsRDD[6] at save at LoaderImpl.java:305), which has no missing 
> parents
> [20-Apr-2020 11:42:34.634]  INFO [dag-scheduler-event-loop 
> org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0 stored 
> as values in memory (estimated size 166.1 KiB, free 18.4 GiB)
> [20-Apr-2020 11:42:34.945]  INFO [dag-scheduler-event-loop 
> org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0_piece0 
> stored as bytes in memory (estimated size 58.0 KiB, free 18.4 GiB)
> [20-Apr-2020 11:42:34.949]  INFO [dispatcher-BlockManagerMaster 
> org.apache.spark.storage.BlockManagerInfo:57] q: - Added broadcast_0_piece0 
> in memory on DESKTOP-A1:58276 (size: 58.0 KiB, free: 18.4 GiB)
> [20-Apr-2020 11:42:34.953]  INFO [dag-scheduler-event-loop 
> org.apache.spark.SparkContext:57] q: - Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1206
> [20-Apr-2020 11:42:34.980]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting 1 missing tasks 
> from ResultStage 0 (MapPartitionsRDD[6] at save at LoaderImpl.java:305) 
> (first 15 tasks are for partitions Vector(0))
> [20-Apr-2020 11:42:34.981]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.TaskSchedulerImpl:57] q: - Adding task set 0.0 
> with 1 tasks
> Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError at 
> java.base/java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:125)
>  at 
> java.base/java.

[jira] [Commented] (SPARK-31496) Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091189#comment-17091189
 ] 

Hyukjin Kwon commented on SPARK-31496:
--

Is this a regression? Sounds more like a question which should be best asked to 
mailing list. You could have a better answer there.

> Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError
> 
>
> Key: SPARK-31496
> URL: https://issues.apache.org/jira/browse/SPARK-31496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Windows 10 (1909)
> JDK 11.0.6
> spark-3.0.0-preview2-bin-hadoop3.2
> local[1]
>  
>  
>Reporter: Tomas Shestakov
>Priority: Major
>  Labels: out-of-memory
>
> Local spark with one core (local[1]) while trying to save Dataset to parquet 
> local file cause OOM. 
> {code:java}
> SparkSession sparkSession = SparkSession.builder()
> .appName("Loader impl test")
> .master("local[1]")
> .config("spark.ui.enabled", false)
> .config("spark.sql.datetime.java8API.enabled", true)
> .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> .config("spark.kryoserializer.buffer.max", "1g")
> .config("spark.executor.memory", "4g")
> .config("spark.driver.memory", "8g")
> .getOrCreate();
> {code}
> {noformat}
> [20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
> o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
> committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> [20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
> o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
> committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> [20-Apr-2020 11:42:27.967]  INFO [boundedElastic-2 
> o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output 
> Committer Algorithm version is 1
> [20-Apr-2020 11:42:27.969]  INFO [boundedElastic-2 
> o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> [20-Apr-2020 11:42:27.970]  INFO [boundedElastic-2 
> o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output 
> Committer Algorithm version is 1
> [20-Apr-2020 11:42:27.973]  INFO [boundedElastic-2 
> o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using output committer 
> class org.apache.parquet.hadoop.ParquetOutputCommitter
> [20-Apr-2020 11:42:34.371]  INFO [boundedElastic-2 
> org.apache.spark.SparkContext:57] q: - Starting job: save at 
> LoaderImpl.java:305
> [20-Apr-2020 11:42:34.389]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.DAGScheduler:57] q: - Got job 0 (save at 
> LoaderImpl.java:305) with 1 output partitions
> [20-Apr-2020 11:42:34.390]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.DAGScheduler:57] q: - Final stage: ResultStage 0 
> (save at LoaderImpl.java:305)
> [20-Apr-2020 11:42:34.390]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.DAGScheduler:57] q: - Parents of final stage: 
> List()
> [20-Apr-2020 11:42:34.392]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.DAGScheduler:57] q: - Missing parents: 
> List()[20-Apr-2020 11:42:34.398]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting ResultStage 0 
> (MapPartitionsRDD[6] at save at LoaderImpl.java:305), which has no missing 
> parents
> [20-Apr-2020 11:42:34.634]  INFO [dag-scheduler-event-loop 
> org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0 stored 
> as values in memory (estimated size 166.1 KiB, free 18.4 GiB)
> [20-Apr-2020 11:42:34.945]  INFO [dag-scheduler-event-loop 
> org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0_piece0 
> stored as bytes in memory (estimated size 58.0 KiB, free 18.4 GiB)
> [20-Apr-2020 11:42:34.949]  INFO [dispatcher-BlockManagerMaster 
> org.apache.spark.storage.BlockManagerInfo:57] q: - Added broadcast_0_piece0 
> in memory on DESKTOP-A1:58276 (size: 58.0 KiB, free: 18.4 GiB)
> [20-Apr-2020 11:42:34.953]  INFO [dag-scheduler-event-loop 
> org.apache.spark.SparkContext:57] q: - Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1206
> [20-Apr-2020 11:42:34.980]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting 1 missing tasks 
> from ResultStage 0 (MapPartitionsRDD[6] at save at LoaderImpl.java:305) 
> (first 15 tasks are for partitions Vector(0))
> [20-Apr-2020 11:42:34.981]  INFO [dag-scheduler-event-loop 
> org.apache.spark.scheduler.TaskSchedulerImpl:57] q: - Adding task set 0.0 
> with 1 tasks
> Exception in thread "di

[jira] [Commented] (SPARK-31502) document identifier in SQL Reference

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091188#comment-17091188
 ] 

Hyukjin Kwon commented on SPARK-31502:
--

PR: https://github.com/apache/spark/pull/28277

> document identifier in SQL Reference
> 
>
> Key: SPARK-31502
> URL: https://issues.apache.org/jira/browse/SPARK-31502
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> document identifier in SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31510) Set setwd in R documentation build

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31510:
-
Fix Version/s: 3.0.0

> Set setwd in R documentation build
> --
>
> Key: SPARK-31510
> URL: https://issues.apache.org/jira/browse/SPARK-31510
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 3.0.0
>
>
> {code}
> > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd"))
> Loading required package: usethis
> Error: Could not find package root, is your working directory inside a 
> package?
> {code}
> Seems like in some environments it fails as above. 
> https://stackoverflow.com/questions/52670051/how-to-troubleshoot-error-could-not-find-package-root
> https://groups.google.com/forum/#!topic/rdevtools/79jjjdc_wjg



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31510) Set setwd in R documentation build

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31510.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28285

> Set setwd in R documentation build
> --
>
> Key: SPARK-31510
> URL: https://issues.apache.org/jira/browse/SPARK-31510
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
>
> {code}
> > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd"))
> Loading required package: usethis
> Error: Could not find package root, is your working directory inside a 
> package?
> {code}
> Seems like in some environments it fails as above. 
> https://stackoverflow.com/questions/52670051/how-to-troubleshoot-error-could-not-find-package-root
> https://groups.google.com/forum/#!topic/rdevtools/79jjjdc_wjg



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-31530) Spark submit fails if we provide extraJavaOption which contains Xmx as substring

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-31530:
--

> Spark submit fails if we provide extraJavaOption which contains Xmx  as 
> substring
> -
>
> Key: SPARK-31530
> URL: https://issues.apache.org/jira/browse/SPARK-31530
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mayank
>Priority: Major
>  Labels: 2.4.0, Spark, Submit
>
> Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions
>  For eg:
> {code:java}
> bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] 
> --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" 
> examples\jars\spark-examples_2.11-2.4.4.jar
> Error: Not allowed to specify max heap(Xmx) memory settings through java 
> options (was -DmyKey=MyValueContainsXmx). Use the corresponding 
> --driver-memory or spark.driver.memory configuration instead.{code}
> https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102
> [https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302]
> Can we update the above condition to check more specific for eg -Xmx
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31530) Spark submit fails if we provide extraJavaOption which contains Xmx as substring

2020-04-23 Thread Mayank (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank updated SPARK-31530:
---
Description: 
Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions
 For eg:
{code:java}
bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] 
--conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" 
examples\jars\spark-examples_2.11-2.4.4.jar
Error: Not allowed to specify max heap(Xmx) memory settings through java 
options (was -DmyKey=MyValueContainsXmx). Use the corresponding --driver-memory 
or spark.driver.memory configuration instead.{code}
https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102

[https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302]

Can we update the above condition to check more specific for eg -Xmx

 

  was:
Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions
 For eg:
{code:java}
bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] 
--conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" 
examples\jars\spark-examples_2.11-2.4.4.jar
Error: Not allowed to specify max heap(Xmx) memory settings through java 
options (was -DmyKey=MyValueContainsXmx). Use the corresponding --driver-memory 
or spark.driver.memory configuration instead.{code}
[https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102

https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302|http://example.com/]

Can we update the above condition to check more specific for eg -Xmx

 


> Spark submit fails if we provide extraJavaOption which contains Xmx  as 
> substring
> -
>
> Key: SPARK-31530
> URL: https://issues.apache.org/jira/browse/SPARK-31530
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mayank
>Priority: Major
>  Labels: 2.4.0, Spark, Submit
>
> Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions
>  For eg:
> {code:java}
> bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] 
> --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" 
> examples\jars\spark-examples_2.11-2.4.4.jar
> Error: Not allowed to specify max heap(Xmx) memory settings through java 
> options (was -DmyKey=MyValueContainsXmx). Use the corresponding 
> --driver-memory or spark.driver.memory configuration instead.{code}
> https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102
> [https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302]
> Can we update the above condition to check more specific for eg -Xmx
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31530) Spark submit fails if we provide extraJavaOption which contains Xmx as substring

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091184#comment-17091184
 ] 

Hyukjin Kwon commented on SPARK-31530:
--

Oh, gotya.

> Spark submit fails if we provide extraJavaOption which contains Xmx  as 
> substring
> -
>
> Key: SPARK-31530
> URL: https://issues.apache.org/jira/browse/SPARK-31530
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mayank
>Priority: Major
>  Labels: 2.4.0, Spark, Submit
>
> Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions
>  For eg:
> {code:java}
> bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] 
> --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" 
> examples\jars\spark-examples_2.11-2.4.4.jar
> Error: Not allowed to specify max heap(Xmx) memory settings through java 
> options (was -DmyKey=MyValueContainsXmx). Use the corresponding 
> --driver-memory or spark.driver.memory configuration instead.{code}
> https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102
> [https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302]
> Can we update the above condition to check more specific for eg -Xmx
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31530) Spark submit fails if we provide extraJavaOption which contains Xmx as substring

2020-04-23 Thread Mayank (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091182#comment-17091182
 ] 

Mayank commented on SPARK-31530:


[~hyukjin.kwon] 
If you check the description I want to specify some properties in 
extraJavaOptions which contains Xmx as substring for eg
spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx
 

> Spark submit fails if we provide extraJavaOption which contains Xmx  as 
> substring
> -
>
> Key: SPARK-31530
> URL: https://issues.apache.org/jira/browse/SPARK-31530
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mayank
>Priority: Major
>  Labels: 2.4.0, Spark, Submit
>
> Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions
>  For eg:
> {code:java}
> bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] 
> --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" 
> examples\jars\spark-examples_2.11-2.4.4.jar
> Error: Not allowed to specify max heap(Xmx) memory settings through java 
> options (was -DmyKey=MyValueContainsXmx). Use the corresponding 
> --driver-memory or spark.driver.memory configuration instead.{code}
> [https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102
> https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302|http://example.com/]
> Can we update the above condition to check more specific for eg -Xmx
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31514) Kerberos: Spark UGI credentials are not getting passed down to Hive

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31514.
--
Resolution: Invalid

> Kerberos: Spark UGI credentials are not getting passed down to Hive
> ---
>
> Key: SPARK-31514
> URL: https://issues.apache.org/jira/browse/SPARK-31514
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Sanchay Javeria
>Priority: Major
>
> I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to 
> run a query via the {{spark-sql}} shell.
> The simplified setup basically looks like this: spark-sql shell running on 
> one host in a Yarn cluster -> external hive-metastore running one host -> S3 
> to store table data.
> When I launch the {{spark-sql}} shell with DEBUG logging enabled, this is 
> what I see in the logs:
> {code:java}
> > bin/spark-sql --proxy-user proxy_user 
> ...
> DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for 
> proxy_user against hive/_h...@realm.com at thrift://hive-metastore:9083 
> DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_h...@realm.com 
> (auth:KERBEROS) 
> from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130){code}
> This means that Spark made a call to fetch the delegation token from the Hive 
> metastore and then added it to the list of credentials for the UGI. [This is 
> the piece of 
> code|https://github.com/apache/spark/blob/branch-2.4/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L129]
>  that does that. I also verified in the metastore logs that the 
> {{get_delegation_token()}} call was being made.
> Now when I run a simple query like {{create table test_table (id int) 
> location "s3://some/prefix";}} I get hit with an AWS credentials error. I 
> modified the hive metastore code and added this right before the file system 
> in Hadoop is initialized 
> ([org/apache/hadoop/hive/metastore/Warehouse.java|#L116]):
> {code:java}
>  public static FileSystem getFs(Path f, Configuration conf) throws 
> MetaException {
> try {
>   UserGroupInformation ugi = UserGroupInformation.getCurrentUser();
>   LOG.info("UGI information: " + ugi);
>   Collection> tokens = 
> ugi.getCredentials().getAllTokens();
>   for(Token token : tokens) {
> LOG.info(token);
>   }
> } catch (IOException e) {
>   e.printStackTrace();
> }
> ...
> {code}
> In the metastore logs, this does print the correct UGI information:
> {code:java}
> UGI information: proxy_user (auth:PROXY) via hive/hive-metast...@realm.com 
> (auth:KERBEROS){code}
> but there are no tokens present in the UGI. Looks like [Spark 
> code|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala#L101]
>  adds it with the alias {{hive.server2.delegation.token}} but I don't see it 
> in the UGI. This makes me suspect that somehow the UGI scope is isolated and 
> not being shared between spark-sql and hive metastore. How do I go about 
> solving this? Any help will be really appreciated!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31514) Kerberos: Spark UGI credentials are not getting passed down to Hive

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091181#comment-17091181
 ] 

Hyukjin Kwon commented on SPARK-31514:
--

It should be best to ask questions into mailing list before filing it as an 
issue. I guess you could have a better answer there.

> Kerberos: Spark UGI credentials are not getting passed down to Hive
> ---
>
> Key: SPARK-31514
> URL: https://issues.apache.org/jira/browse/SPARK-31514
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Sanchay Javeria
>Priority: Major
>
> I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to 
> run a query via the {{spark-sql}} shell.
> The simplified setup basically looks like this: spark-sql shell running on 
> one host in a Yarn cluster -> external hive-metastore running one host -> S3 
> to store table data.
> When I launch the {{spark-sql}} shell with DEBUG logging enabled, this is 
> what I see in the logs:
> {code:java}
> > bin/spark-sql --proxy-user proxy_user 
> ...
> DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for 
> proxy_user against hive/_h...@realm.com at thrift://hive-metastore:9083 
> DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_h...@realm.com 
> (auth:KERBEROS) 
> from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130){code}
> This means that Spark made a call to fetch the delegation token from the Hive 
> metastore and then added it to the list of credentials for the UGI. [This is 
> the piece of 
> code|https://github.com/apache/spark/blob/branch-2.4/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L129]
>  that does that. I also verified in the metastore logs that the 
> {{get_delegation_token()}} call was being made.
> Now when I run a simple query like {{create table test_table (id int) 
> location "s3://some/prefix";}} I get hit with an AWS credentials error. I 
> modified the hive metastore code and added this right before the file system 
> in Hadoop is initialized 
> ([org/apache/hadoop/hive/metastore/Warehouse.java|#L116]):
> {code:java}
>  public static FileSystem getFs(Path f, Configuration conf) throws 
> MetaException {
> try {
>   UserGroupInformation ugi = UserGroupInformation.getCurrentUser();
>   LOG.info("UGI information: " + ugi);
>   Collection> tokens = 
> ugi.getCredentials().getAllTokens();
>   for(Token token : tokens) {
> LOG.info(token);
>   }
> } catch (IOException e) {
>   e.printStackTrace();
> }
> ...
> {code}
> In the metastore logs, this does print the correct UGI information:
> {code:java}
> UGI information: proxy_user (auth:PROXY) via hive/hive-metast...@realm.com 
> (auth:KERBEROS){code}
> but there are no tokens present in the UGI. Looks like [Spark 
> code|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala#L101]
>  adds it with the alias {{hive.server2.delegation.token}} but I don't see it 
> in the UGI. This makes me suspect that somehow the UGI scope is isolated and 
> not being shared between spark-sql and hive metastore. How do I go about 
> solving this? Any help will be really appreciated!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091179#comment-17091179
 ] 

Hyukjin Kwon commented on SPARK-31519:
--

PR: https://github.com/apache/spark/pull/28294

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31521) The fetch size is not correct when merging blocks into a merged block

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091178#comment-17091178
 ] 

Hyukjin Kwon commented on SPARK-31521:
--

PR: https://github.com/apache/spark/pull/28301

> The fetch size is not correct when merging blocks into a merged block
> -
>
> Key: SPARK-31521
> URL: https://issues.apache.org/jira/browse/SPARK-31521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> When merging blocks into a merged block, we should count the size of that 
> merged block as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31524) Add metric to the split number for skew partition when enable AQE

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091175#comment-17091175
 ] 

Hyukjin Kwon commented on SPARK-31524:
--

PR: https://github.com/apache/spark/pull/28109

> Add metric to the split  number for skew partition when enable AQE
> --
>
> Key: SPARK-31524
> URL: https://issues.apache.org/jira/browse/SPARK-31524
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> Add the details metrics for the split number in skewed partitions when enable 
> AQE and skew join optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31523) LogicalPlan doCanonicalize should throw exception if not resolved

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091176#comment-17091176
 ] 

Hyukjin Kwon commented on SPARK-31523:
--

PR: https://github.com/apache/spark/pull/28304

> LogicalPlan doCanonicalize should throw exception if not resolved
> -
>
> Key: SPARK-31523
> URL: https://issues.apache.org/jira/browse/SPARK-31523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31527) date add/subtract interval only allow those day precision in ansi mode

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091174#comment-17091174
 ] 

Hyukjin Kwon commented on SPARK-31527:
--

PR https://github.com/apache/spark/pull/28310

> date add/subtract interval only allow those day precision in ansi mode
> --
>
> Key: SPARK-31527
> URL: https://issues.apache.org/jira/browse/SPARK-31527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Under ANSI mode, we should not allow date add interval with hours, minutes... 
> microseconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31529) Remove extra whitespaces in the formatted explain

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091172#comment-17091172
 ] 

Hyukjin Kwon commented on SPARK-31529:
--

PR: https://github.com/apache/spark/pull/28315

> Remove extra whitespaces in the formatted explain
> -
>
> Key: SPARK-31529
> URL: https://issues.apache.org/jira/browse/SPARK-31529
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> The formatted explain included extra whitespaces. And even the number of 
> spaces are different between master and branch-3.0, which leads to failed 
> explain tests if we backport to branch-3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31528) Remove millennium, century, decade from trunc/date_trunc fucntions

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091173#comment-17091173
 ] 

Hyukjin Kwon commented on SPARK-31528:
--

PR: https://github.com/apache/spark/pull/28313

> Remove millennium, century, decade from  trunc/date_trunc fucntions
> ---
>
> Key: SPARK-31528
> URL: https://issues.apache.org/jira/browse/SPARK-31528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Same as #SPARK-31507, millennium, century, and decade are not commonly used 
> in most modern platforms.
> for example
> Negative:
> https://docs.snowflake.com/en/sql-reference/functions-date-time.html#supported-date-and-time-parts
> https://prestodb.io/docs/current/functions/datetime.html#date_trunc
> https://teradata.github.io/presto/docs/148t/functions/datetime.html#date_trunc
> https://www.oracletutorial.com/oracle-date-functions/oracle-trunc/
> Positive:
> https://docs.aws.amazon.com/redshift/latest/dg/r_Dateparts_for_datetime_functions.html
> https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31530) Spark submit fails if we provide extraJavaOption which contains Xmx as substring

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31530.
--
Resolution: Won't Fix

> Spark submit fails if we provide extraJavaOption which contains Xmx  as 
> substring
> -
>
> Key: SPARK-31530
> URL: https://issues.apache.org/jira/browse/SPARK-31530
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mayank
>Priority: Major
>  Labels: 2.4.0, Spark, Submit
>
> Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions
>  For eg:
> {code:java}
> bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] 
> --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" 
> examples\jars\spark-examples_2.11-2.4.4.jar
> Error: Not allowed to specify max heap(Xmx) memory settings through java 
> options (was -DmyKey=MyValueContainsXmx). Use the corresponding 
> --driver-memory or spark.driver.memory configuration instead.{code}
> [https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102
> https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302|http://example.com/]
> Can we update the above condition to check more specific for eg -Xmx
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31530) Spark submit fails if we provide extraJavaOption which contains Xmx as substring

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091171#comment-17091171
 ] 

Hyukjin Kwon commented on SPARK-31530:
--

Why don't you follow the guide and use {{spark.driver.memory}}?

> Spark submit fails if we provide extraJavaOption which contains Xmx  as 
> substring
> -
>
> Key: SPARK-31530
> URL: https://issues.apache.org/jira/browse/SPARK-31530
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mayank
>Priority: Major
>  Labels: 2.4.0, Spark, Submit
>
> Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions
>  For eg:
> {code:java}
> bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] 
> --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" 
> examples\jars\spark-examples_2.11-2.4.4.jar
> Error: Not allowed to specify max heap(Xmx) memory settings through java 
> options (was -DmyKey=MyValueContainsXmx). Use the corresponding 
> --driver-memory or spark.driver.memory configuration instead.{code}
> [https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102
> https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302|http://example.com/]
> Can we update the above condition to check more specific for eg -Xmx
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31531) sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner() method not found during spark-submit

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31531.
--
Resolution: Duplicate

> sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner() method not found during 
> spark-submit
> ---
>
> Key: SPARK-31531
> URL: https://issues.apache.org/jira/browse/SPARK-31531
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.5
>Reporter: shayoni Halder
>Priority: Major
> Attachments: error.PNG
>
>
> I am trying to run the following Spark submit from a VM using Yarn cluster 
> mode.
>  ./spark-submit --master yarn --deploy-mode client test_spark_yarn.py
> The VM has java version 11 and spark-2.4.5 while the yarn cluster java 8 and 
> spark-2.4.0. I am getting the error below:
> !error.PNG!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31531) sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner() method not found during spark-submit

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091170#comment-17091170
 ] 

Hyukjin Kwon commented on SPARK-31531:
--

Spark does not support Java 11 in Spark 2.x. It will be supported in Spark 3.

> sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner() method not found during 
> spark-submit
> ---
>
> Key: SPARK-31531
> URL: https://issues.apache.org/jira/browse/SPARK-31531
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.5
>Reporter: shayoni Halder
>Priority: Major
> Attachments: error.PNG
>
>
> I am trying to run the following Spark submit from a VM using Yarn cluster 
> mode.
>  ./spark-submit --master yarn --deploy-mode client test_spark_yarn.py
> The VM has java version 11 and spark-2.4.5 while the yarn cluster java 8 and 
> spark-2.4.0. I am getting the error below:
> !error.PNG!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31532) SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091169#comment-17091169
 ] 

Hyukjin Kwon commented on SPARK-31532:
--

PR is in progress at https://github.com/apache/spark/pull/28316

> SparkSessionBuilder shoud not propagate static sql configurations to the 
> existing active/default SparkSession
> -
>
> Key: SPARK-31532
> URL: https://issues.apache.org/jira/browse/SPARK-31532
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Clearly, this is a bug.
> {code:java}
> scala> spark.sql("set spark.sql.warehouse.dir").show
> +++
> | key|   value|
> +++
> |spark.sql.warehou...|file:/Users/kenty...|
> +++
> scala> spark.sql("set spark.sql.warehouse.dir=2");
> org.apache.spark.sql.AnalysisException: Cannot modify the value of a static 
> config: spark.sql.warehouse.dir;
>   at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154)
>   at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42)
>   at 
> org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100)
>   at 
> org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
>   ... 47 elided
> scala> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.SparkSession
> scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get
> getClass   getOrCreate
> scala> SparkSession.builder.config("spark.sql.warehouse.dir", 
> "xyz").getOrCreate
> 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; 
> some configuration may not take effect.
> res7: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@6403d574
> scala> spark.sql("set spark.sql.warehouse.dir").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.warehou...|  xyz|
> ++-+
> scala>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31544) Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from checkpoint

2020-04-23 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091163#comment-17091163
 ] 

Dongjoon Hyun commented on SPARK-31544:
---

This is resolved via https://github.com/apache/spark/pull/28320 . 

> Backport SPARK-30199   Recover `spark.(ui|blockManager).port` from 
> checkpoint
> -
>
> Key: SPARK-31544
> URL: https://issues.apache.org/jira/browse/SPARK-31544
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.6
>
>
> Backport SPARK-30199       Recover `spark.(ui|blockManager).port` from 
> checkpoint
> cc [~dongjoon] for if you think this is a good candidate



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30199) Recover spark.ui.port and spark.blockManager.port from checkpoint

2020-04-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30199:
--
Issue Type: Bug  (was: Improvement)

> Recover spark.ui.port and spark.blockManager.port from checkpoint
> -
>
> Key: SPARK-30199
> URL: https://issues.apache.org/jira/browse/SPARK-30199
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Aaruna Godthi
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30199) Recover spark.ui.port and spark.blockManager.port from checkpoint

2020-04-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30199:
--
Fix Version/s: 2.4.6

> Recover spark.ui.port and spark.blockManager.port from checkpoint
> -
>
> Key: SPARK-30199
> URL: https://issues.apache.org/jira/browse/SPARK-30199
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Aaruna Godthi
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31544) Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from checkpoint

2020-04-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31544.
---
Fix Version/s: 2.4.6
   Resolution: Fixed

> Backport SPARK-30199   Recover `spark.(ui|blockManager).port` from 
> checkpoint
> -
>
> Key: SPARK-31544
> URL: https://issues.apache.org/jira/browse/SPARK-31544
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.6
>
>
> Backport SPARK-30199       Recover `spark.(ui|blockManager).port` from 
> checkpoint
> cc [~dongjoon] for if you think this is a good candidate



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31532) SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091158#comment-17091158
 ] 

Hyukjin Kwon commented on SPARK-31532:
--

The problem is that the static configuration was changed during runtime.

> SparkSessionBuilder shoud not propagate static sql configurations to the 
> existing active/default SparkSession
> -
>
> Key: SPARK-31532
> URL: https://issues.apache.org/jira/browse/SPARK-31532
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Clearly, this is a bug.
> {code:java}
> scala> spark.sql("set spark.sql.warehouse.dir").show
> +++
> | key|   value|
> +++
> |spark.sql.warehou...|file:/Users/kenty...|
> +++
> scala> spark.sql("set spark.sql.warehouse.dir=2");
> org.apache.spark.sql.AnalysisException: Cannot modify the value of a static 
> config: spark.sql.warehouse.dir;
>   at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154)
>   at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42)
>   at 
> org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100)
>   at 
> org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
>   ... 47 elided
> scala> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.SparkSession
> scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get
> getClass   getOrCreate
> scala> SparkSession.builder.config("spark.sql.warehouse.dir", 
> "xyz").getOrCreate
> 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; 
> some configuration may not take effect.
> res7: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@6403d574
> scala> spark.sql("set spark.sql.warehouse.dir").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.warehou...|  xyz|
> ++-+
> scala>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31547) Upgrade Genjavadoc to 0.16

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31547:


Assignee: Dongjoon Hyun

> Upgrade Genjavadoc to 0.16
> --
>
> Key: SPARK-31547
> URL: https://issues.apache.org/jira/browse/SPARK-31547
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31547) Upgrade Genjavadoc to 0.16

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31547.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28321

> Upgrade Genjavadoc to 0.16
> --
>
> Key: SPARK-31547
> URL: https://issues.apache.org/jira/browse/SPARK-31547
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31545) Backport SPARK-27676 InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091152#comment-17091152
 ] 

Hyukjin Kwon commented on SPARK-31545:
--

I think no .. it causes a behaviour change which can be pretty critical in SS 
cases (see the migration guide updated).

> Backport SPARK-27676   InMemoryFileIndex should respect 
> spark.sql.files.ignoreMissingFiles
> --
>
> Key: SPARK-31545
> URL: https://issues.apache.org/jira/browse/SPARK-31545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-27676       InMemoryFileIndex should respect 
> spark.sql.files.ignoreMissingFiles
> cc [~joshrosen] I think backporting this has been asked in the original 
> ticket, do you have any objections?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091150#comment-17091150
 ] 

Hyukjin Kwon commented on SPARK-31538:
--

Hm, I wonder why we should backport this. This was just a test-only and cleanup.
BTW, do we need this file a JIRA for each backport? I think you can just use 
the existing JIRA, backport, and fix the Fix Veersion.

> Backport SPARK-25338   Ensure to call super.beforeAll() and 
> super.afterAll() in test cases
> --
>
> Key: SPARK-31538
> URL: https://issues.apache.org/jira/browse/SPARK-31538
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25338       Ensure to call super.beforeAll() and 
> super.afterAll() in test cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31537) Backport SPARK-25559 Remove the unsupported predicates in Parquet when possible

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091149#comment-17091149
 ] 

Hyukjin Kwon commented on SPARK-31537:
--

I wouldn't port this back per the guidelines in our versioning policy 
(https://spark.apache.org/versioning-policy.html). Improvement seems usually 
not ported back.

> Backport SPARK-25559  Remove the unsupported predicates in Parquet when 
> possible
> 
>
> Key: SPARK-31537
> URL: https://issues.apache.org/jira/browse/SPARK-31537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.6
>
>
> Consider backporting SPARK-25559       Remove the unsupported predicates in 
> Parquet when possible to 2.4.6



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31537) Backport SPARK-25559 Remove the unsupported predicates in Parquet when possible

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31537:
-
Target Version/s: 2.4.6

> Backport SPARK-25559  Remove the unsupported predicates in Parquet when 
> possible
> 
>
> Key: SPARK-31537
> URL: https://issues.apache.org/jira/browse/SPARK-31537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Assignee: DB Tsai
>Priority: Major
>
> Consider backporting SPARK-25559       Remove the unsupported predicates in 
> Parquet when possible to 2.4.6



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31537) Backport SPARK-25559 Remove the unsupported predicates in Parquet when possible

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31537:
-
Fix Version/s: (was: 2.4.6)

> Backport SPARK-25559  Remove the unsupported predicates in Parquet when 
> possible
> 
>
> Key: SPARK-31537
> URL: https://issues.apache.org/jira/browse/SPARK-31537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Assignee: DB Tsai
>Priority: Major
>
> Consider backporting SPARK-25559       Remove the unsupported predicates in 
> Parquet when possible to 2.4.6



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31536) Backport SPARK-25407 Allow nested access for non-existent field for Parquet file when nested pruning is enabled

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091148#comment-17091148
 ] 

Hyukjin Kwon commented on SPARK-31536:
--

To backport this, we should port SPARK-31116 together. I tend to think we 
shouldn't backport this also given that this feature is disabled by default in 
Spark 2.4 - it also affects the codes when the option is disabled.

> Backport SPARK-25407   Allow nested access for non-existent field for 
> Parquet file when nested pruning is enabled
> -
>
> Key: SPARK-31536
> URL: https://issues.apache.org/jira/browse/SPARK-31536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Consider backporting SPARK-25407       Allow nested access for non-existent 
> field for Parquet file when nested pruning is enabled to 2.4.6



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31546) Backport SPARK-25595 Ignore corrupt Avro file if flag IGNORE_CORRUPT_FILES enabled

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091146#comment-17091146
 ] 

Hyukjin Kwon commented on SPARK-31546:
--

I think it's fine to port back.

> Backport SPARK-25595   Ignore corrupt Avro file if flag 
> IGNORE_CORRUPT_FILES enabled
> 
>
> Key: SPARK-31546
> URL: https://issues.apache.org/jira/browse/SPARK-31546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25595       Ignore corrupt Avro file if flag 
> IGNORE_CORRUPT_FILES enabled
> cc [~Gengliang.Wang]& [~hyukjin.kwon] for comments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30804) Measure and log elapsed time for "compact" operation in CompactibleFileStreamLog

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30804:


Assignee: Jungtaek Lim

> Measure and log elapsed time for "compact" operation in 
> CompactibleFileStreamLog
> 
>
> Key: SPARK-30804
> URL: https://issues.apache.org/jira/browse/SPARK-30804
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> "compact" operation in FileStreamSourceLog and FileStreamSinkLog is 
> introduced to solve "small files" problem, but introduced non-trivial latency 
> which is another headache in long run query.
> There're bunch of reports from community for the same issue (see SPARK-24295, 
> SPARK-29995, SPARK-30462) - before trying to solve the problem, it would be 
> better to measure the latency (elapsed time) and log to help indicating the 
> issue when the additional latency becomes concerns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30804) Measure and log elapsed time for "compact" operation in CompactibleFileStreamLog

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30804.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27557
[https://github.com/apache/spark/pull/27557]

> Measure and log elapsed time for "compact" operation in 
> CompactibleFileStreamLog
> 
>
> Key: SPARK-30804
> URL: https://issues.apache.org/jira/browse/SPARK-30804
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> "compact" operation in FileStreamSourceLog and FileStreamSinkLog is 
> introduced to solve "small files" problem, but introduced non-trivial latency 
> which is another headache in long run query.
> There're bunch of reports from community for the same issue (see SPARK-24295, 
> SPARK-29995, SPARK-30462) - before trying to solve the problem, it would be 
> better to measure the latency (elapsed time) and log to help indicating the 
> issue when the additional latency becomes concerns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly

2020-04-23 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-31549:
---
Issue Type: Bug  (was: Improvement)

> Pyspark SparkContext.cancelJobGroup do not work correctly
> -
>
> Key: SPARK-31549
> URL: https://issues.apache.org/jira/browse/SPARK-31549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue 
> existing for a long time. This is because of pyspark thread didn't pinned to 
> jvm thread when invoking java side methods, which leads to all pyspark API 
> which related to java local thread variables do not work correctly. 
> (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` 
> and so on.)
> This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode 
> added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two 
> issue:
> * It is disabled by default. We need to set additional environment variable 
> to enable it.
> * There's memory leak issue which haven't been addressed.
> Now there's a series of project like hyperopt-spark, spark-joblib which rely 
> on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it 
> is critical to address this issue and we hope it work under default pyspark 
> mode. An optional approach is implementing methods like 
> `rdd.setGroupAndCollect`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly

2020-04-23 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-31549:
--

 Summary: Pyspark SparkContext.cancelJobGroup do not work correctly
 Key: SPARK-31549
 URL: https://issues.apache.org/jira/browse/SPARK-31549
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.5, 3.0.0
Reporter: Weichen Xu


Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue 
existing for a long time. This is because of pyspark thread didn't pinned to 
jvm thread when invoking java side methods, which leads to all pyspark API 
which related to java local thread variables do not work correctly. (Including 
`sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` and so on.)

This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode 
added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two issue:
* It is disabled by default. We need to set additional environment variable to 
enable it.
* There's memory leak issue which haven't been addressed.

Now there's a series of project like hyperopt-spark, spark-joblib which rely on 
`sc.cancelJobGroup` API (use it to stop running jobs in their code). So it is 
critical to address this issue and we hope it work under default pyspark mode. 
An optional approach is implementing methods like `rdd.setGroupAndCollect`.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31548) Refactor pyspark code for common methods in JavaParams and Pipeline/OneVsRest

2020-04-23 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-31548:
--

 Summary: Refactor pyspark code for common methods in JavaParams 
and Pipeline/OneVsRest
 Key: SPARK-31548
 URL: https://issues.apache.org/jira/browse/SPARK-31548
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Weichen Xu


Background: See discussion here
https://github.com/apache/spark/pull/28273#discussion_r411462216
and
https://github.com/apache/spark/pull/28279#discussion_r412699397




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31488) Support `java.time.LocalDate` in Parquet filter pushdown

2020-04-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31488.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28259
[https://github.com/apache/spark/pull/28259]

> Support `java.time.LocalDate` in Parquet filter pushdown
> 
>
> Key: SPARK-31488
> URL: https://issues.apache.org/jira/browse/SPARK-31488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, ParquetFilters supports only java.sql.Date values of DateType, and 
> explicitly casts Any to java.sql.Date, see
> https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176
> So, any filters refer to date values are not pushed down to Parquet when 
> spark.sql.datetime.java8API.enabled is true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31488) Support `java.time.LocalDate` in Parquet filter pushdown

2020-04-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31488:
---

Assignee: Maxim Gekk

> Support `java.time.LocalDate` in Parquet filter pushdown
> 
>
> Key: SPARK-31488
> URL: https://issues.apache.org/jira/browse/SPARK-31488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently, ParquetFilters supports only java.sql.Date values of DateType, and 
> explicitly casts Any to java.sql.Date, see
> https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176
> So, any filters refer to date values are not pushed down to Parquet when 
> spark.sql.datetime.java8API.enabled is true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31526) Add a new test suite for ExperssionInfo

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31526:


Assignee: Takeshi Yamamuro

> Add a new test suite for ExperssionInfo
> ---
>
> Key: SPARK-31526
> URL: https://issues.apache.org/jira/browse/SPARK-31526
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31526) Add a new test suite for ExperssionInfo

2020-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31526.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28308
[https://github.com/apache/spark/pull/28308]

> Add a new test suite for ExperssionInfo
> ---
>
> Key: SPARK-31526
> URL: https://issues.apache.org/jira/browse/SPARK-31526
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2020-04-23 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091075#comment-17091075
 ] 

Jungtaek Lim commented on SPARK-26385:
--

The symptoms are mixed up - please clarify whether the exception occurs 
(driver, AM, executor, somewhere else??), which mode you use, which 
configuration you used to try mitigating it.

Please file a new issue with above information per case. Adding comments with 
different case in here might emphasize of the importance of the issue, but not 
helpful to investigate on the issue. Please also note that we need driver / AM 
/ executor log because we should check interaction among them (how delegation 
tokens were passed).

> YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in 
> cache
> ---
>
> Key: SPARK-26385
> URL: https://issues.apache.org/jira/browse/SPARK-26385
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Hadoop 2.6.0, Spark 2.4.0
>Reporter: T M
>Priority: Major
>
>  
> Hello,
>  
> I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, 
> Spark 2.4.0). After 25-26 hours, my job stops working with following error:
> {code:java}
> 2018-12-16 22:35:17 ERROR 
> org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query 
> TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = 
> a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, 
> realUser=, issueDate=1544903057122, maxDate=1545507857122, 
> sequenceNumber=10314, masterKeyId=344) can't be found in cache at 
> org.apache.hadoop.ipc.Client.call(Client.java:1470) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1401) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at 
> org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at 
> org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at 
> org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActi

[jira] [Comment Edited] (SPARK-31532) SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession

2020-04-23 Thread JinxinTang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091067#comment-17091067
 ] 

JinxinTang edited comment on SPARK-31532 at 4/24/20, 1:08 AM:
--

Thanks for your issue, these followings may not be allowed to modify after 
sparksession startup by design:

[spark.sql.codegen.comments, spark.sql.queryExecutionListeners, 
spark.sql.catalogImplementation, spark.sql.subquery.maxThreadThreshold, 
spark.sql.globalTempDatabase, spark.sql.codegen.cache.maxEntries, 
spark.sql.filesourceTableRelationCacheSize, 
spark.sql.streaming.streamingQueryListeners, spark.sql.ui.retainedExecutions, 
spark.sql.hive.thriftServer.singleSession, spark.sql.extensions, 
spark.sql.debug, spark.sql.sources.schemaStringLengthThreshold, 
spark.sql.warehouse.dir] 

So it is might not a bug.


was (Author: jinxintang):
Thanks for your issue, these followings not be modified after sparksession 
startup by design:

[spark.sql.codegen.comments, spark.sql.queryExecutionListeners, 
spark.sql.catalogImplementation, spark.sql.subquery.maxThreadThreshold, 
spark.sql.globalTempDatabase, spark.sql.codegen.cache.maxEntries, 
spark.sql.filesourceTableRelationCacheSize, 
spark.sql.streaming.streamingQueryListeners, spark.sql.ui.retainedExecutions, 
spark.sql.hive.thriftServer.singleSession, spark.sql.extensions, 
spark.sql.debug, spark.sql.sources.schemaStringLengthThreshold, 
spark.sql.warehouse.dir] 

> SparkSessionBuilder shoud not propagate static sql configurations to the 
> existing active/default SparkSession
> -
>
> Key: SPARK-31532
> URL: https://issues.apache.org/jira/browse/SPARK-31532
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Clearly, this is a bug.
> {code:java}
> scala> spark.sql("set spark.sql.warehouse.dir").show
> +++
> | key|   value|
> +++
> |spark.sql.warehou...|file:/Users/kenty...|
> +++
> scala> spark.sql("set spark.sql.warehouse.dir=2");
> org.apache.spark.sql.AnalysisException: Cannot modify the value of a static 
> config: spark.sql.warehouse.dir;
>   at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154)
>   at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42)
>   at 
> org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100)
>   at 
> org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
>   ... 47 elided
> scala> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.SparkSession
> scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get
> getClass   getOrCreate
> scala> SparkSession.builder.config("spark.sql.warehouse.dir", 
> "xyz").getOrCreate
> 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; 
> some configuration may not take effect.
> res7: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@6403d574
> scala> spark.sql("set spark.sql

[jira] [Commented] (SPARK-31532) SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession

2020-04-23 Thread JinxinTang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091067#comment-17091067
 ] 

JinxinTang commented on SPARK-31532:


Thanks for your issue, these followings not be modified after sparksession 
startup by design:

[spark.sql.codegen.comments, spark.sql.queryExecutionListeners, 
spark.sql.catalogImplementation, spark.sql.subquery.maxThreadThreshold, 
spark.sql.globalTempDatabase, spark.sql.codegen.cache.maxEntries, 
spark.sql.filesourceTableRelationCacheSize, 
spark.sql.streaming.streamingQueryListeners, spark.sql.ui.retainedExecutions, 
spark.sql.hive.thriftServer.singleSession, spark.sql.extensions, 
spark.sql.debug, spark.sql.sources.schemaStringLengthThreshold, 
spark.sql.warehouse.dir] 

> SparkSessionBuilder shoud not propagate static sql configurations to the 
> existing active/default SparkSession
> -
>
> Key: SPARK-31532
> URL: https://issues.apache.org/jira/browse/SPARK-31532
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Clearly, this is a bug.
> {code:java}
> scala> spark.sql("set spark.sql.warehouse.dir").show
> +++
> | key|   value|
> +++
> |spark.sql.warehou...|file:/Users/kenty...|
> +++
> scala> spark.sql("set spark.sql.warehouse.dir=2");
> org.apache.spark.sql.AnalysisException: Cannot modify the value of a static 
> config: spark.sql.warehouse.dir;
>   at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154)
>   at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42)
>   at 
> org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100)
>   at 
> org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
>   ... 47 elided
> scala> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.SparkSession
> scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get
> getClass   getOrCreate
> scala> SparkSession.builder.config("spark.sql.warehouse.dir", 
> "xyz").getOrCreate
> 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; 
> some configuration may not take effect.
> res7: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@6403d574
> scala> spark.sql("set spark.sql.warehouse.dir").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.warehou...|  xyz|
> ++-+
> scala>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31542) Backport SPARK-25692 Remove static initialization of worker eventLoop handling chunk fetch requests within TransportContext. This fixes ChunkFetchIntegrationSuite

2020-04-23 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-31542.
--
Resolution: Not A Problem

> Backport SPARK-25692   Remove static initialization of worker eventLoop 
> handling chunk fetch requests within TransportContext. This fixes 
> ChunkFetchIntegrationSuite as well
> 
>
> Key: SPARK-31542
> URL: https://issues.apache.org/jira/browse/SPARK-31542
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25692       Remove static initialization of worker eventLoop 
> handling chunk fetch requests within TransportContext. This fixes 
> ChunkFetchIntegrationSuite as well.
> While the test was only flaky in the 3.0 branch, it seems possible the same 
> code path could be triggered in 2.4 so consider for backport.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31542) Backport SPARK-25692 Remove static initialization of worker eventLoop handling chunk fetch requests within TransportContext. This fixes ChunkFetchIntegrationSuit

2020-04-23 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091060#comment-17091060
 ] 

Shixiong Zhu commented on SPARK-31542:
--

[~holden] The flaky test was caused by a new improvement in 3.0: SPARK-24355 It 
doesn't impact branch-2.4.

> Backport SPARK-25692   Remove static initialization of worker eventLoop 
> handling chunk fetch requests within TransportContext. This fixes 
> ChunkFetchIntegrationSuite as well
> 
>
> Key: SPARK-31542
> URL: https://issues.apache.org/jira/browse/SPARK-31542
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25692       Remove static initialization of worker eventLoop 
> handling chunk fetch requests within TransportContext. This fixes 
> ChunkFetchIntegrationSuite as well.
> While the test was only flaky in the 3.0 branch, it seems possible the same 
> code path could be triggered in 2.4 so consider for backport.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31464) Upgrade Kafka to 2.5.0

2020-04-23 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091051#comment-17091051
 ] 

Dongjoon Hyun commented on SPARK-31464:
---

Thank you, [~ijuma]! :)

> Upgrade Kafka to 2.5.0
> --
>
> Key: SPARK-31464
> URL: https://issues.apache.org/jira/browse/SPARK-31464
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2020-04-23 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-27891.
--
Resolution: Cannot Reproduce

SPARK-23361 is in Spark 2.4.0 and the fix is not going to be 2.3.x as 2.3.x is 
EOL - please reopen if anyone encounters this in 2.4.x.

> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Critical
> Attachments: application_1559242207407_0001.log, 
> spark_2.3.1_failure.log
>
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
> Following command was used to submit the spark job
> bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
> --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
> /tmp/ff1.txt
>  
> Application Logs attached
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31547) Upgrade Genjavadoc to 0.16

2020-04-23 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-31547:
-

 Summary: Upgrade Genjavadoc to 0.16
 Key: SPARK-31547
 URL: https://issues.apache.org/jira/browse/SPARK-31547
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2020-04-23 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090996#comment-17090996
 ] 

Dongjoon Hyun commented on SPARK-25075:
---

Thank you for updates, [~smarter].

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31540) Backport SPARK-27981 Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` in JDK9+

2020-04-23 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090954#comment-17090954
 ] 

Holden Karau commented on SPARK-31540:
--

I was thinking some folks might build 2.4 with newer JDKs.

> Backport SPARK-27981   Remove `Illegal reflective access` warning for 
> `java.nio.Bits.unaligned()` in JDK9+
> --
>
> Key: SPARK-31540
> URL: https://issues.apache.org/jira/browse/SPARK-31540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> SPARK-27981       Remove `Illegal reflective access` warning for 
> `java.nio.Bits.unaligned()` in JDK9+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31540) Backport SPARK-27981 Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` in JDK9+

2020-04-23 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090944#comment-17090944
 ] 

Sean R. Owen commented on SPARK-31540:
--

The backport is probably harmless, but, why is it needed for 2.4.x? This helps 
JDK 11 compatibility, but 2.4 won't work with JDK 11.

> Backport SPARK-27981   Remove `Illegal reflective access` warning for 
> `java.nio.Bits.unaligned()` in JDK9+
> --
>
> Key: SPARK-31540
> URL: https://issues.apache.org/jira/browse/SPARK-31540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> SPARK-27981       Remove `Illegal reflective access` warning for 
> `java.nio.Bits.unaligned()` in JDK9+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31544) Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from checkpoint

2020-04-23 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090928#comment-17090928
 ] 

Holden Karau commented on SPARK-31544:
--

Thanks!

> Backport SPARK-30199   Recover `spark.(ui|blockManager).port` from 
> checkpoint
> -
>
> Key: SPARK-31544
> URL: https://issues.apache.org/jira/browse/SPARK-31544
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Backport SPARK-30199       Recover `spark.(ui|blockManager).port` from 
> checkpoint
> cc [~dongjoon] for if you think this is a good candidate



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31544) Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from checkpoint

2020-04-23 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090926#comment-17090926
 ] 

Dongjoon Hyun commented on SPARK-31544:
---

I made a PR, [~holden].
- https://github.com/apache/spark/pull/28320

> Backport SPARK-30199   Recover `spark.(ui|blockManager).port` from 
> checkpoint
> -
>
> Key: SPARK-31544
> URL: https://issues.apache.org/jira/browse/SPARK-31544
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Backport SPARK-30199       Recover `spark.(ui|blockManager).port` from 
> checkpoint
> cc [~dongjoon] for if you think this is a good candidate



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31540) Backport SPARK-27981 Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` in JDK9+

2020-04-23 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090924#comment-17090924
 ] 

Holden Karau commented on SPARK-31540:
--

Gotcha, I'll go through the backport JIRAs and link.

> Backport SPARK-27981   Remove `Illegal reflective access` warning for 
> `java.nio.Bits.unaligned()` in JDK9+
> --
>
> Key: SPARK-31540
> URL: https://issues.apache.org/jira/browse/SPARK-31540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> SPARK-27981       Remove `Illegal reflective access` warning for 
> `java.nio.Bits.unaligned()` in JDK9+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31540) Backport SPARK-27981 Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` in JDK9+

2020-04-23 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090922#comment-17090922
 ] 

Dongjoon Hyun commented on SPARK-31540:
---

cc [~srowen]

> Backport SPARK-27981   Remove `Illegal reflective access` warning for 
> `java.nio.Bits.unaligned()` in JDK9+
> --
>
> Key: SPARK-31540
> URL: https://issues.apache.org/jira/browse/SPARK-31540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> SPARK-27981       Remove `Illegal reflective access` warning for 
> `java.nio.Bits.unaligned()` in JDK9+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31540) Backport SPARK-27981 Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` in JDK9+

2020-04-23 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090921#comment-17090921
 ] 

Dongjoon Hyun commented on SPARK-31540:
---

[~holden]. Could you link the original JIRA additionally? Embedding to the 
description doesn't provide a bi-directional visibility.

> Backport SPARK-27981   Remove `Illegal reflective access` warning for 
> `java.nio.Bits.unaligned()` in JDK9+
> --
>
> Key: SPARK-31540
> URL: https://issues.apache.org/jira/browse/SPARK-31540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> SPARK-27981       Remove `Illegal reflective access` warning for 
> `java.nio.Bits.unaligned()` in JDK9+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31544) Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from checkpoint

2020-04-23 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090917#comment-17090917
 ] 

Dongjoon Hyun commented on SPARK-31544:
---

BTW, I kept the original authorship from the beginning. This will be the same 
for `branch-2.4`.

> Backport SPARK-30199   Recover `spark.(ui|blockManager).port` from 
> checkpoint
> -
>
> Key: SPARK-31544
> URL: https://issues.apache.org/jira/browse/SPARK-31544
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Backport SPARK-30199       Recover `spark.(ui|blockManager).port` from 
> checkpoint
> cc [~dongjoon] for if you think this is a good candidate



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31539) Backport SPARK-27138 Remove AdminUtils calls (fixes deprecation)

2020-04-23 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090915#comment-17090915
 ] 

Holden Karau commented on SPARK-31539:
--

The main reason I'd see to backport the change is it's only in test and if 
someone wants to build & test with a newer Kafka library it might be useful. 
But now that I think about it some more it's probably not worth it, I'll close 
as won't fix.

> Backport SPARK-27138   Remove AdminUtils calls (fixes deprecation)
> --
>
> Key: SPARK-31539
> URL: https://issues.apache.org/jira/browse/SPARK-31539
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> SPARK-27138       Remove AdminUtils calls (fixes deprecation)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31539) Backport SPARK-27138 Remove AdminUtils calls (fixes deprecation)

2020-04-23 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-31539.
--
Resolution: Won't Fix

> Backport SPARK-27138   Remove AdminUtils calls (fixes deprecation)
> --
>
> Key: SPARK-31539
> URL: https://issues.apache.org/jira/browse/SPARK-31539
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> SPARK-27138       Remove AdminUtils calls (fixes deprecation)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31539) Backport SPARK-27138 Remove AdminUtils calls (fixes deprecation)

2020-04-23 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090914#comment-17090914
 ] 

Holden Karau commented on SPARK-31539:
--

The other Jira is marked as resolved and I want to track the outstanding issues 
for 2.4.6 to make sure we don't leave anything behind.

> Backport SPARK-27138   Remove AdminUtils calls (fixes deprecation)
> --
>
> Key: SPARK-31539
> URL: https://issues.apache.org/jira/browse/SPARK-31539
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> SPARK-27138       Remove AdminUtils calls (fixes deprecation)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31485) Barrier stage can hang if only partial tasks launched

2020-04-23 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-31485:
-
Shepherd: Holden Karau

> Barrier stage can hang if only partial tasks launched
> -
>
> Key: SPARK-31485
> URL: https://issues.apache.org/jira/browse/SPARK-31485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
>
> The issue can be reproduced by following test:
>  
> {code:java}
> initLocalClusterSparkContext(2)
> val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2)
> val dep = new OneToOneDependency[Int](rdd0)
> val rdd = new MyRDD(sc, 2, List(dep), 
> Seq(Seq("executor_h_0"),Seq("executor_h_0")))
> rdd.barrier().mapPartitions { iter =>
>   BarrierTaskContext.get().barrier()
>   iter
> }.collect()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31485) Barrier stage can hang if only partial tasks launched

2020-04-23 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-31485:
-
Target Version/s: 2.4.6, 3.0.0

> Barrier stage can hang if only partial tasks launched
> -
>
> Key: SPARK-31485
> URL: https://issues.apache.org/jira/browse/SPARK-31485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
>
> The issue can be reproduced by following test:
>  
> {code:java}
> initLocalClusterSparkContext(2)
> val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2)
> val dep = new OneToOneDependency[Int](rdd0)
> val rdd = new MyRDD(sc, 2, List(dep), 
> Seq(Seq("executor_h_0"),Seq("executor_h_0")))
> rdd.barrier().mapPartitions { iter =>
>   BarrierTaskContext.get().barrier()
>   iter
> }.collect()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31543) Backport SPARK-26306 More memory to de-flake SorterSuite

2020-04-23 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090911#comment-17090911
 ] 

Sean R. Owen commented on SPARK-31543:
--

Does it need to be a new JIRA?
but if it's a simple test change, I think it's plausible to back-port, if it 
affects 2.4.x.

> Backport SPARK-26306   More memory to de-flake SorterSuite
> --
>
> Key: SPARK-31543
> URL: https://issues.apache.org/jira/browse/SPARK-31543
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> SPARK-26306       More memory to de-flake SorterSuite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >