[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials
[ https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated SPARK-31551: -- Description: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *SPARK_USER* only gets the UserGroupInformation.getCurrentUser().getShortUserName() of the user, which may lost the user's fully qualified user name. We should better use the *getUserName* to get fully qualified user name in our client side, which is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*. Related to https://issues.apache.org/jira/browse/SPARK-1051 was: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *SPARK_USER* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side, which is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*. Related to https://issues.apache.org/jira/browse/SPARK-1051 > createSparkUser lost user's non-Hadoop credentials > -- > > Key: SPARK-31551 > URL: https://issues.apache.org/jira/browse/SPARK-31551 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 2.4.5 >Reporter: Yuqi Wang >Priority: Major > > See current > *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/
[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name
[ https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated SPARK-31551: -- Description: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *SPARK_USER* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side, which is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*. Related to https://issues.apache.org/jira/browse/SPARK-1051 was: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side, which is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*. Related to https://issues.apache.org/jira/browse/SPARK-1051 > createSparkUser lost user's non-Hadoop credentials and fully qualified user > name > > > Key: SPARK-31551 > URL: https://issues.apache.org/jira/browse/SPARK-31551 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 2.4.5 >Reporter: Yuqi Wang >Priority: Major > > See current > *[createSparkUser|htt
[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials
[ https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated SPARK-31551: -- Summary: createSparkUser lost user's non-Hadoop credentials (was: createSparkUser lost user's non-Hadoop credentials and fully qualified user name) > createSparkUser lost user's non-Hadoop credentials > -- > > Key: SPARK-31551 > URL: https://issues.apache.org/jira/browse/SPARK-31551 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 2.4.5 >Reporter: Yuqi Wang >Priority: Major > > See current > *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: > {code:java} > def createSparkUser(): UserGroupInformation = { > val user = Utils.getCurrentUserName() > logDebug("creating UGI for user: " + user) > val ugi = UserGroupInformation.createRemoteUser(user) > transferCredentials(UserGroupInformation.getCurrentUser(), ugi) > ugi > } > def transferCredentials(source: UserGroupInformation, dest: > UserGroupInformation): Unit = { > dest.addCredentials(source.getCredentials()) > } > def getCurrentUserName(): String = { > Option(System.getenv("SPARK_USER")) > .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) > } > {code} > The *transferCredentials* func can only transfer Hadoop creds such as > Delegation Tokens. > However, other creds stored in UGI.subject.getPrivateCredentials, will be > lost here, such as: > # Non-Hadoop creds: > Such as, [Kafka creds > |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] > # Newly supported or 3rd party supported Hadoop creds: > Such as to support OAuth/JWT token authn on Hadoop, we need to store the > OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens > are not supposed to be managed by Hadoop Credentials (currently it is only > for Hadoop secret keys and delegation tokens) > Another issue is that the *SPARK_USER* only returns the getShortUserName of > the user, which may lost the user's fully qualified user name that need to be > passed to PRC server (such as YARN, HDFS, Kafka). We should better use the > *getUserName* to get fully qualified user name in our client side, which is > aligned to > *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*. > Related to https://issues.apache.org/jira/browse/SPARK-1051 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson
[ https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091232#comment-17091232 ] Shashanka Balakuntala Srinivasa commented on SPARK-31463: - Hi [~hyukjin.kwon], I will start looking into this. Thanks. > Enhance JsonDataSource by replacing jackson with simdjson > - > > Key: SPARK-31463 > URL: https://issues.apache.org/jira/browse/SPARK-31463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Steven Moy >Priority: Minor > > I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how > to improve json reading speed. We use Spark to process terabytes of JSON, so > we try to find ways to improve JSON parsing speed. > > [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/] > > [https://github.com/simdjson/simdjson/issues/93] > > Anyone on the opensource communty interested in leading this effort to > integrate simdjson in spark json data source api? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson
[ https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091230#comment-17091230 ] Hyukjin Kwon commented on SPARK-31463: -- Separate source might be ideal. We can start it from separate project and gradually move it into Apache Spark when it's proven very useful later. > Enhance JsonDataSource by replacing jackson with simdjson > - > > Key: SPARK-31463 > URL: https://issues.apache.org/jira/browse/SPARK-31463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Steven Moy >Priority: Minor > > I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how > to improve json reading speed. We use Spark to process terabytes of JSON, so > we try to find ways to improve JSON parsing speed. > > [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/] > > [https://github.com/simdjson/simdjson/issues/93] > > Anyone on the opensource communty interested in leading this effort to > integrate simdjson in spark json data source api? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31438) Support JobCleaned Status in SparkListener
[ https://issues.apache.org/jira/browse/SPARK-31438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091225#comment-17091225 ] Hyukjin Kwon commented on SPARK-31438: -- PR https://github.com/apache/spark/pull/28280 > Support JobCleaned Status in SparkListener > -- > > Key: SPARK-31438 > URL: https://issues.apache.org/jira/browse/SPARK-31438 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Jackey Lee >Priority: Major > > In Spark, we need do some hook after job cleaned, such as cleaning hive > external temporary paths. This has already discussed in SPARK-31346 and > [GitHub Pull Request #28129.|https://github.com/apache/spark/pull/28129] > The JobEnd Status is not suitable for this. As JobEnd is responsible for Job > finished, once all result has generated, it should be finished. After finish, > Scheduler will leave the still running tasks to be zombie tasks and delete > abnormal tasks asynchronously. > Thus, we add JobCleaned Status to enable user to do some hook after all > tasks cleaned in Job. The JobCleaned Status can get from TaskSetManagers, > which is related to a stage, and once all stages of the job has been cleaned, > then the job is cleaned. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31453) Error while converting JavaRDD to Dataframe
[ https://issues.apache.org/jira/browse/SPARK-31453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31453. -- Resolution: Duplicate It duplicates SPARK-23862. See SPARK-21255 for the workaround > Error while converting JavaRDD to Dataframe > --- > > Key: SPARK-31453 > URL: https://issues.apache.org/jira/browse/SPARK-31453 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.5 >Reporter: Sachit Sharma >Priority: Trivial > > Please refer to this: > [https://stackoverflow.com/questions/61172007/error-while-converting-javardd-to-dataframe] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc
[ https://issues.apache.org/jira/browse/SPARK-31550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091220#comment-17091220 ] Kent Yao commented on SPARK-31550: -- Github webhook temporary downtime, PR in progress: [https://github.com/apache/spark/pull/28322] > nondeterministic configurations with general meanings in sql configuration doc > -- > > Key: SPARK-31550 > URL: https://issues.apache.org/jira/browse/SPARK-31550 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > spark.sql.session.timeZone > spark.sql.warehouse.dir > > these 2 configs are nondeterministic and vary with environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name
[ https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated SPARK-31551: -- Description: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side, which is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*. Related to https://issues.apache.org/jira/browse/SPARK-1051 was: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side, which is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*. > createSparkUser lost user's non-Hadoop credentials and fully qualified user > name > > > Key: SPARK-31551 > URL: https://issues.apache.org/jira/browse/SPARK-31551 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 2.4.5 >Reporter: Yuqi Wang >Priority: Major > > See current > *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10
[jira] [Commented] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc
[ https://issues.apache.org/jira/browse/SPARK-31550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091218#comment-17091218 ] JinxinTang commented on SPARK-31550: try to specify conf in spark-defaults.conf spark.sql.warehouse.dir /tmp spark.sql.session.timeZone America/New_York It not seems a bug > nondeterministic configurations with general meanings in sql configuration doc > -- > > Key: SPARK-31550 > URL: https://issues.apache.org/jira/browse/SPARK-31550 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > spark.sql.session.timeZone > spark.sql.warehouse.dir > > these 2 configs are nondeterministic and vary with environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name
[ https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated SPARK-31551: -- Description: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side, which is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*. was: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side, which is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*. > createSparkUser lost user's non-Hadoop credentials and fully qualified user > name > > > Key: SPARK-31551 > URL: https://issues.apache.org/jira/browse/SPARK-31551 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 2.4.5 >Reporter: Yuqi Wang >Priority: Major > > See current > *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/
[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name
[ https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated SPARK-31551: -- Description: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side. This is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*. was: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side. This is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]* > createSparkUser lost user's non-Hadoop credentials and fully qualified user > name > > > Key: SPARK-31551 > URL: https://issues.apache.org/jira/browse/SPARK-31551 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 2.4.5 >Reporter: Yuqi Wang >Priority: Major > > See current > *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHad
[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name
[ https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated SPARK-31551: -- Description: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side, which is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*. was: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side. This is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]*. > createSparkUser lost user's non-Hadoop credentials and fully qualified user > name > > > Key: SPARK-31551 > URL: https://issues.apache.org/jira/browse/SPARK-31551 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 2.4.5 >Reporter: Yuqi Wang >Priority: Major > > See current > *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkH
[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name
[ https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated SPARK-31551: -- Description: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side. This is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]* was: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side. This is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]* > createSparkUser lost user's non-Hadoop credentials and fully qualified user > name > > > Key: SPARK-31551 > URL: https://issues.apache.org/jira/browse/SPARK-31551 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 2.4.5 >Reporter: Yuqi Wang >Priority: Major > > See current > *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkH
[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name
[ https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated SPARK-31551: -- Description: See current *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, [Kafka creds |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395] # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side. This is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]* was: See current *createSparkUser*: [https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76] {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, Kafka creds, # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side. This is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]* > createSparkUser lost user's non-Hadoop credentials and fully qualified user > name > > > Key: SPARK-31551 > URL: https://issues.apache.org/jira/browse/SPARK-31551 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 2.4.5 >Reporter: Yuqi Wang >Priority: Major > > See current > *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*: > > {code:java} > def createSparkUser(): UserGroupInformation = { > val user = Utils.getCurrentUserName() > logDebug("creating UGI for user:
[jira] [Updated] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name
[ https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated SPARK-31551: -- Description: See current *createSparkUser*: [https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76] {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The *transferCredentials* func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: # Non-Hadoop creds: Such as, Kafka creds, # Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens) Another issue is that the *getCurrentUserName* only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should better use the *getUserName* to get fully qualified user name in our client side. This is aligned to *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L720]* was: Current createRemoteUser: [https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76] {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The transferCredentials func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: 1. Non-Hadoop creds: Such as, Kafka creds, https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395 2. Customized Hadoop creds: Such as support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently only for Hadoop secret keys and delegation Tokens) Another issue is that the getCurrentUserName only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should use getUserName to get fully qualified user name in our client side. > createSparkUser lost user's non-Hadoop credentials and fully qualified user > name > > > Key: SPARK-31551 > URL: https://issues.apache.org/jira/browse/SPARK-31551 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4, 2.4.5 >Reporter: Yuqi Wang >Priority: Major > > See current *createSparkUser*: > [https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76] > {code:java} > def createSparkUser(): UserGroupInformation = { > val user = Utils.getCurrentUserName() > logDebug("creating UGI for user: " + user) > val ugi = UserGroupInformation.createRemoteUser(user) > transferCredentials(UserGroupInformation.getCurrentUser(), ugi) > ugi > } > def transferCredentials(source: UserGroupInformation, dest: > UserGroupInformation): Unit = { > dest.addCredentials(source.
[jira] [Created] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials and fully qualified user name
Yuqi Wang created SPARK-31551: - Summary: createSparkUser lost user's non-Hadoop credentials and fully qualified user name Key: SPARK-31551 URL: https://issues.apache.org/jira/browse/SPARK-31551 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.5, 2.4.4 Reporter: Yuqi Wang Current createRemoteUser: [https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76] {code:java} def createSparkUser(): UserGroupInformation = { val user = Utils.getCurrentUserName() logDebug("creating UGI for user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi } def transferCredentials(source: UserGroupInformation, dest: UserGroupInformation): Unit = { dest.addCredentials(source.getCredentials()) } def getCurrentUserName(): String = { Option(System.getenv("SPARK_USER")) .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName()) } {code} The transferCredentials func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: 1. Non-Hadoop creds: Such as, Kafka creds, https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395 2. Customized Hadoop creds: Such as support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently only for Hadoop secret keys and delegation Tokens) Another issue is that the getCurrentUserName only returns the getShortUserName of the user, which may lost the user's fully qualified user name that need to be passed to PRC server (such as YARN, HDFS, Kafka). We should use getUserName to get fully qualified user name in our client side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc
[ https://issues.apache.org/jira/browse/SPARK-31550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-31550: - Component/s: Documentation > nondeterministic configurations with general meanings in sql configuration doc > -- > > Key: SPARK-31550 > URL: https://issues.apache.org/jira/browse/SPARK-31550 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > spark.sql.session.timeZone > spark.sql.warehouse.dir > > these 2 configs are nondeterministic and vary with environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc
Kent Yao created SPARK-31550: Summary: nondeterministic configurations with general meanings in sql configuration doc Key: SPARK-31550 URL: https://issues.apache.org/jira/browse/SPARK-31550 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao spark.sql.session.timeZone spark.sql.warehouse.dir these 2 configs are nondeterministic and vary with environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson
[ https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091205#comment-17091205 ] Steven Moy commented on SPARK-31463: Hi [~hyukjin.kwon], What's Spark recommended path on introducing C code? I was following SQLite and DuckDB, their approach is to inline the dependency (bring the code in in the case of compatible license). Or would it better to support simdjson as a compeletely separate DataSourcev2 implementation? simdjson license is Apache License as well; [https://github.com/simdjson/simdjson/blob/master/LICENSE] > Enhance JsonDataSource by replacing jackson with simdjson > - > > Key: SPARK-31463 > URL: https://issues.apache.org/jira/browse/SPARK-31463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Steven Moy >Priority: Minor > > I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how > to improve json reading speed. We use Spark to process terabytes of JSON, so > we try to find ways to improve JSON parsing speed. > > [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/] > > [https://github.com/simdjson/simdjson/issues/93] > > Anyone on the opensource communty interested in leading this effort to > integrate simdjson in spark json data source api? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31461) RLIKE and LIKE expression compiles every time when it used
[ https://issues.apache.org/jira/browse/SPARK-31461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091201#comment-17091201 ] Linpx commented on SPARK-31461: --- i did search,didn't found any > RLIKE and LIKE expression compiles every time when it used > -- > > Key: SPARK-31461 > URL: https://issues.apache.org/jira/browse/SPARK-31461 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Linpx >Priority: Minor > > org.apache.spark.sql.catalyst.expressions > regexpExpressions.scala > line: 41 > {code:scala} > // try cache the pattern for Literal > private lazy val cache: Pattern = right match { > case x @ Literal(value: String, StringType) => compile(value) > case _ => null > } > {code} > StringType Literal value is UTF8String by default -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31461) RLIKE and LIKE expression compiles every time when it used
[ https://issues.apache.org/jira/browse/SPARK-31461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31461. -- Resolution: Duplicate > RLIKE and LIKE expression compiles every time when it used > -- > > Key: SPARK-31461 > URL: https://issues.apache.org/jira/browse/SPARK-31461 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Linpx >Priority: Minor > > org.apache.spark.sql.catalyst.expressions > regexpExpressions.scala > line: 41 > {code:scala} > // try cache the pattern for Literal > private lazy val cache: Pattern = right match { > case x @ Literal(value: String, StringType) => compile(value) > case _ => null > } > {code} > StringType Literal value is UTF8String by default -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31461) RLIKE and LIKE expression compiles every time when it used
[ https://issues.apache.org/jira/browse/SPARK-31461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091200#comment-17091200 ] Hyukjin Kwon commented on SPARK-31461: -- Please search existing jiras first before filing it. > RLIKE and LIKE expression compiles every time when it used > -- > > Key: SPARK-31461 > URL: https://issues.apache.org/jira/browse/SPARK-31461 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Linpx >Priority: Minor > > org.apache.spark.sql.catalyst.expressions > regexpExpressions.scala > line: 41 > {code:scala} > // try cache the pattern for Literal > private lazy val cache: Pattern = right match { > case x @ Literal(value: String, StringType) => compile(value) > case _ => null > } > {code} > StringType Literal value is UTF8String by default -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson
[ https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091199#comment-17091199 ] Hyukjin Kwon commented on SPARK-31463: -- So it's about vectorization, right? I think [~maxgekk] talked about vectorization somewhere. My biggest concern is that if it's right to bring the C library into Spark as a dependency or not. > Enhance JsonDataSource by replacing jackson with simdjson > - > > Key: SPARK-31463 > URL: https://issues.apache.org/jira/browse/SPARK-31463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Steven Moy >Priority: Minor > > I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how > to improve json reading speed. We use Spark to process terabytes of JSON, so > we try to find ways to improve JSON parsing speed. > > [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/] > > [https://github.com/simdjson/simdjson/issues/93] > > Anyone on the opensource communty interested in leading this effort to > integrate simdjson in spark json data source api? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson
[ https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31463: - Component/s: (was: Spark Core) SQL > Enhance JsonDataSource by replacing jackson with simdjson > - > > Key: SPARK-31463 > URL: https://issues.apache.org/jira/browse/SPARK-31463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Steven Moy >Priority: Minor > > I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how > to improve json reading speed. We use Spark to process terabytes of JSON, so > we try to find ways to improve JSON parsing speed. > > [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/] > > [https://github.com/simdjson/simdjson/issues/93] > > Anyone on the opensource communty interested in leading this effort to > integrate simdjson in spark json data source api? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31466) String/Int to VarcharType cast not supported in Spark
[ https://issues.apache.org/jira/browse/SPARK-31466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091196#comment-17091196 ] Hyukjin Kwon commented on SPARK-31466: -- Can you read the doc for the class? {quote} * Hive varchar type. Similar to other HiveStringType's, these datatypes should only used for * parsing, and should NOT be used anywhere else. Any instance of these data types should be * replaced by a [[StringType]] before analysis. {quote} It's not supposed to use as an API. > String/Int to VarcharType cast not supported in Spark > --- > > Key: SPARK-31466 > URL: https://issues.apache.org/jira/browse/SPARK-31466 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Gourav Choubey >Priority: Major > > While casting a string column to varchar it does not do the casting at all > and column remains string. > > I tried to achieve it through VarcharType as below but it errors : > for(i<-colList) > { Df=Df.withColumn(i,Df(i).cast(VarcharType(1000))) } > *org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`ColName AS > VARCHAR(1000))' due to data type mismatch: cannot cast string to > varchar(1000);;* > > *Also, I tried through selectExpr cast option but no success.* > > While trying to create an empty dataframe with VarcharType also throws an > error > *scala> var empty_df = spark.createDataFrame(sc.emptyRDD[Row], schema_rdd)* > *scala.MatchError: VarcharType(1000) (of class > org.apache.spark.sql.types.VarcharType)* > *at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.org$apache$spark$sql$catalyst$enco* > > Please suggest a way to cast a string to varchar in spark. As a reading > string column in SAS application has performance implications. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31466) String/Int to VarcharType cast not supported in Spark
[ https://issues.apache.org/jira/browse/SPARK-31466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31466. -- Resolution: Invalid > String/Int to VarcharType cast not supported in Spark > --- > > Key: SPARK-31466 > URL: https://issues.apache.org/jira/browse/SPARK-31466 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Gourav Choubey >Priority: Major > > While casting a string column to varchar it does not do the casting at all > and column remains string. > > I tried to achieve it through VarcharType as below but it errors : > for(i<-colList) > { Df=Df.withColumn(i,Df(i).cast(VarcharType(1000))) } > *org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`ColName AS > VARCHAR(1000))' due to data type mismatch: cannot cast string to > varchar(1000);;* > > *Also, I tried through selectExpr cast option but no success.* > > While trying to create an empty dataframe with VarcharType also throws an > error > *scala> var empty_df = spark.createDataFrame(sc.emptyRDD[Row], schema_rdd)* > *scala.MatchError: VarcharType(1000) (of class > org.apache.spark.sql.types.VarcharType)* > *at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.org$apache$spark$sql$catalyst$enco* > > Please suggest a way to cast a string to varchar in spark. As a reading > string column in SAS application has performance implications. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31479) Numbers with thousands separator or locale specific decimal separator not parsed correctly
[ https://issues.apache.org/jira/browse/SPARK-31479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31479. -- Resolution: Duplicate > Numbers with thousands separator or locale specific decimal separator not > parsed correctly > -- > > Key: SPARK-31479 > URL: https://issues.apache.org/jira/browse/SPARK-31479 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Ranjit Iyer >Priority: Major > > CSV files that contain numbers with thousands separator (or locale specific > decimal separators) are not parsed correctly and are reported as {{null.}} > [https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html] > A user in France might expect "10,100" to be parsed as a float while a user > in the US might want Spark to interpret it as an Integer value (10100). > UnivocityParser is not locale aware and must use NumberFormatter to parse > string values to Numbers. > *US Locale* > {{scala>Source.fromFile("/Users/ranjit.iyer/work/data/us.csv").getLines.mkString("\n")}} > {{res28: String =}} > {{"Value"}} > {{"10,000"}} > {{"20,000"}} > {{scala> Locale.setDefault(Locale.US)}} > {{scala> val _schema = StructType(StructField("Value", IntegerType, true) :: > Nil)}} > {{scala> val df = spark.read.format("csv").option("header", > "true").schema(_schema).load("/Users/ranjit.iyer/work/data/us.csv")}} > {{df: org.apache.spark.sql.DataFrame = [Value: int]}} > {{scala> df.show}} > {{+-+}} > {{|Value|}} > {{+-+}} > {{| null|}} > {{| null|}} > {{+-+}} > *French Local* > {{scala> > Source.fromFile("/Users/ranjit.iyer/work/data/fr.csv").getLines.mkString("\n")}} > {{res43: String =}} > {{"Value"}} > {{"10,123"}} > {{"20,456"}} > {{scala> Locale.setDefault(Locale.FRANCE)}} > {{scala> val _schema = StructType(StructField("Value", FloatType, true) :: > Nil)}} > {{scala> val df = spark.read.format("csv").option("header", > "true").schema(_schema).load("/Users/ranjit.iyer/work/data/fr.csv")}} > {{df: org.apache.spark.sql.DataFrame = [Value: float]}} > {{scala> df.show}} > {{+-+}} > {{|Value|}} > {{+-+}} > {{| null|}} > {{| null|}} > {{+-+}} > The fix is to use a NumberFormatter and I have it working locally and will > raise a PR for review. > {{NumberFormat.getInstance.parse(_).intValue()}} > Thousands separator are quite commonly found on the internet. My workflow has > been to copy to Excel, export to csv and analyze in Spark. > [https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31479) Numbers with thousands separator or locale specific decimal separator not parsed correctly
[ https://issues.apache.org/jira/browse/SPARK-31479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091193#comment-17091193 ] Hyukjin Kwon commented on SPARK-31479: -- Use locale option. See SPARK-25945 > Numbers with thousands separator or locale specific decimal separator not > parsed correctly > -- > > Key: SPARK-31479 > URL: https://issues.apache.org/jira/browse/SPARK-31479 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Ranjit Iyer >Priority: Major > > CSV files that contain numbers with thousands separator (or locale specific > decimal separators) are not parsed correctly and are reported as {{null.}} > [https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html] > A user in France might expect "10,100" to be parsed as a float while a user > in the US might want Spark to interpret it as an Integer value (10100). > UnivocityParser is not locale aware and must use NumberFormatter to parse > string values to Numbers. > *US Locale* > {{scala>Source.fromFile("/Users/ranjit.iyer/work/data/us.csv").getLines.mkString("\n")}} > {{res28: String =}} > {{"Value"}} > {{"10,000"}} > {{"20,000"}} > {{scala> Locale.setDefault(Locale.US)}} > {{scala> val _schema = StructType(StructField("Value", IntegerType, true) :: > Nil)}} > {{scala> val df = spark.read.format("csv").option("header", > "true").schema(_schema).load("/Users/ranjit.iyer/work/data/us.csv")}} > {{df: org.apache.spark.sql.DataFrame = [Value: int]}} > {{scala> df.show}} > {{+-+}} > {{|Value|}} > {{+-+}} > {{| null|}} > {{| null|}} > {{+-+}} > *French Local* > {{scala> > Source.fromFile("/Users/ranjit.iyer/work/data/fr.csv").getLines.mkString("\n")}} > {{res43: String =}} > {{"Value"}} > {{"10,123"}} > {{"20,456"}} > {{scala> Locale.setDefault(Locale.FRANCE)}} > {{scala> val _schema = StructType(StructField("Value", FloatType, true) :: > Nil)}} > {{scala> val df = spark.read.format("csv").option("header", > "true").schema(_schema).load("/Users/ranjit.iyer/work/data/fr.csv")}} > {{df: org.apache.spark.sql.DataFrame = [Value: float]}} > {{scala> df.show}} > {{+-+}} > {{|Value|}} > {{+-+}} > {{| null|}} > {{| null|}} > {{+-+}} > The fix is to use a NumberFormatter and I have it working locally and will > raise a PR for review. > {{NumberFormat.getInstance.parse(_).intValue()}} > Thousands separator are quite commonly found on the internet. My workflow has > been to copy to Excel, export to csv and analyze in Spark. > [https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31493) Optimize InSet to In according partition size at InSubqueryExec
[ https://issues.apache.org/jira/browse/SPARK-31493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091190#comment-17091190 ] Hyukjin Kwon commented on SPARK-31493: -- PR: https://github.com/apache/spark/pull/28269 > Optimize InSet to In according partition size at InSubqueryExec > --- > > Key: SPARK-31493 > URL: https://issues.apache.org/jira/browse/SPARK-31493 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31482) spark.kubernetes.driver.podTemplateFile Configuration not used by the job
[ https://issues.apache.org/jira/browse/SPARK-31482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31482: - Priority: Major (was: Blocker) > spark.kubernetes.driver.podTemplateFile Configuration not used by the job > - > > Key: SPARK-31482 > URL: https://issues.apache.org/jira/browse/SPARK-31482 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Pradeep Misra >Priority: Major > > Spark 3.0 - Running Spark Submit as below and point to a MinKube cluster > {code:java} > bin/spark-submit \ > --master k8s://https://192.168.99.102:8443 \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.kubernetes.driver.podTemplateFile=../driver_1E.template \ > --conf spark.kubernetes.executor.podTemplateFile=../executor.template \ > --conf spark.kubernetes.container.image=spark:spark3 \ > local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-preview2.jar 1 > {code} > > Spark Binaries - spark-3.0.0-preview2-bin-hadoop2.7.tgz > Driver Template - > {code:java} > apiVersion: v1 > kind: Pod > metadata: > labels: > spark-app-id: my-custom-id > annotations: > spark-driver-cpu: 1 > spark-driver-mem: 1 > spark-executor-cpu: 1 > spark-executor-mem: 1 > spark-executor-count: 1 > spec: > schedulerName: spark-scheduler{code} > Executor Template > > {code:java} > apiVersion: v1 > kind: Pod > metadata: > labels: > spark-app-id: my-custom-id > spec: > schedulerName: spark-scheduler{code} > Kubernetes Pods Launched - Two Executor Pods were launched which was default > {code:java} > spark-pi-e608e7718f11cc69-driver 1/1 Running 0 10s > spark-pi-e608e7718f11cc69-exec-1 1/1 Running 0 5s > spark-pi-e608e7718f11cc69-exec-2 1/1 Running 0 5s{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31496) Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-31496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31496. -- Resolution: Invalid > Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError > > > Key: SPARK-31496 > URL: https://issues.apache.org/jira/browse/SPARK-31496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Windows 10 (1909) > JDK 11.0.6 > spark-3.0.0-preview2-bin-hadoop3.2 > local[1] > > >Reporter: Tomas Shestakov >Priority: Major > Labels: out-of-memory > > Local spark with one core (local[1]) while trying to save Dataset to parquet > local file cause OOM. > {code:java} > SparkSession sparkSession = SparkSession.builder() > .appName("Loader impl test") > .master("local[1]") > .config("spark.ui.enabled", false) > .config("spark.sql.datetime.java8API.enabled", true) > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > .config("spark.kryoserializer.buffer.max", "1g") > .config("spark.executor.memory", "4g") > .config("spark.driver.memory", "8g") > .getOrCreate(); > {code} > {noformat} > [20-Apr-2020 11:42:27.877] INFO [boundedElastic-2 > o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output > committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter > [20-Apr-2020 11:42:27.877] INFO [boundedElastic-2 > o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output > committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter > [20-Apr-2020 11:42:27.967] INFO [boundedElastic-2 > o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output > Committer Algorithm version is 1 > [20-Apr-2020 11:42:27.969] INFO [boundedElastic-2 > o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using user defined > output committer class org.apache.parquet.hadoop.ParquetOutputCommitter > [20-Apr-2020 11:42:27.970] INFO [boundedElastic-2 > o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output > Committer Algorithm version is 1 > [20-Apr-2020 11:42:27.973] INFO [boundedElastic-2 > o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using output committer > class org.apache.parquet.hadoop.ParquetOutputCommitter > [20-Apr-2020 11:42:34.371] INFO [boundedElastic-2 > org.apache.spark.SparkContext:57] q: - Starting job: save at > LoaderImpl.java:305 > [20-Apr-2020 11:42:34.389] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.DAGScheduler:57] q: - Got job 0 (save at > LoaderImpl.java:305) with 1 output partitions > [20-Apr-2020 11:42:34.390] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.DAGScheduler:57] q: - Final stage: ResultStage 0 > (save at LoaderImpl.java:305) > [20-Apr-2020 11:42:34.390] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.DAGScheduler:57] q: - Parents of final stage: > List() > [20-Apr-2020 11:42:34.392] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.DAGScheduler:57] q: - Missing parents: > List()[20-Apr-2020 11:42:34.398] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting ResultStage 0 > (MapPartitionsRDD[6] at save at LoaderImpl.java:305), which has no missing > parents > [20-Apr-2020 11:42:34.634] INFO [dag-scheduler-event-loop > org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0 stored > as values in memory (estimated size 166.1 KiB, free 18.4 GiB) > [20-Apr-2020 11:42:34.945] INFO [dag-scheduler-event-loop > org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0_piece0 > stored as bytes in memory (estimated size 58.0 KiB, free 18.4 GiB) > [20-Apr-2020 11:42:34.949] INFO [dispatcher-BlockManagerMaster > org.apache.spark.storage.BlockManagerInfo:57] q: - Added broadcast_0_piece0 > in memory on DESKTOP-A1:58276 (size: 58.0 KiB, free: 18.4 GiB) > [20-Apr-2020 11:42:34.953] INFO [dag-scheduler-event-loop > org.apache.spark.SparkContext:57] q: - Created broadcast 0 from broadcast at > DAGScheduler.scala:1206 > [20-Apr-2020 11:42:34.980] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting 1 missing tasks > from ResultStage 0 (MapPartitionsRDD[6] at save at LoaderImpl.java:305) > (first 15 tasks are for partitions Vector(0)) > [20-Apr-2020 11:42:34.981] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.TaskSchedulerImpl:57] q: - Adding task set 0.0 > with 1 tasks > Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError at > java.base/java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:125) > at > java.base/java.
[jira] [Commented] (SPARK-31496) Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-31496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091189#comment-17091189 ] Hyukjin Kwon commented on SPARK-31496: -- Is this a regression? Sounds more like a question which should be best asked to mailing list. You could have a better answer there. > Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError > > > Key: SPARK-31496 > URL: https://issues.apache.org/jira/browse/SPARK-31496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Windows 10 (1909) > JDK 11.0.6 > spark-3.0.0-preview2-bin-hadoop3.2 > local[1] > > >Reporter: Tomas Shestakov >Priority: Major > Labels: out-of-memory > > Local spark with one core (local[1]) while trying to save Dataset to parquet > local file cause OOM. > {code:java} > SparkSession sparkSession = SparkSession.builder() > .appName("Loader impl test") > .master("local[1]") > .config("spark.ui.enabled", false) > .config("spark.sql.datetime.java8API.enabled", true) > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > .config("spark.kryoserializer.buffer.max", "1g") > .config("spark.executor.memory", "4g") > .config("spark.driver.memory", "8g") > .getOrCreate(); > {code} > {noformat} > [20-Apr-2020 11:42:27.877] INFO [boundedElastic-2 > o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output > committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter > [20-Apr-2020 11:42:27.877] INFO [boundedElastic-2 > o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output > committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter > [20-Apr-2020 11:42:27.967] INFO [boundedElastic-2 > o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output > Committer Algorithm version is 1 > [20-Apr-2020 11:42:27.969] INFO [boundedElastic-2 > o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using user defined > output committer class org.apache.parquet.hadoop.ParquetOutputCommitter > [20-Apr-2020 11:42:27.970] INFO [boundedElastic-2 > o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output > Committer Algorithm version is 1 > [20-Apr-2020 11:42:27.973] INFO [boundedElastic-2 > o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using output committer > class org.apache.parquet.hadoop.ParquetOutputCommitter > [20-Apr-2020 11:42:34.371] INFO [boundedElastic-2 > org.apache.spark.SparkContext:57] q: - Starting job: save at > LoaderImpl.java:305 > [20-Apr-2020 11:42:34.389] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.DAGScheduler:57] q: - Got job 0 (save at > LoaderImpl.java:305) with 1 output partitions > [20-Apr-2020 11:42:34.390] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.DAGScheduler:57] q: - Final stage: ResultStage 0 > (save at LoaderImpl.java:305) > [20-Apr-2020 11:42:34.390] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.DAGScheduler:57] q: - Parents of final stage: > List() > [20-Apr-2020 11:42:34.392] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.DAGScheduler:57] q: - Missing parents: > List()[20-Apr-2020 11:42:34.398] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting ResultStage 0 > (MapPartitionsRDD[6] at save at LoaderImpl.java:305), which has no missing > parents > [20-Apr-2020 11:42:34.634] INFO [dag-scheduler-event-loop > org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0 stored > as values in memory (estimated size 166.1 KiB, free 18.4 GiB) > [20-Apr-2020 11:42:34.945] INFO [dag-scheduler-event-loop > org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0_piece0 > stored as bytes in memory (estimated size 58.0 KiB, free 18.4 GiB) > [20-Apr-2020 11:42:34.949] INFO [dispatcher-BlockManagerMaster > org.apache.spark.storage.BlockManagerInfo:57] q: - Added broadcast_0_piece0 > in memory on DESKTOP-A1:58276 (size: 58.0 KiB, free: 18.4 GiB) > [20-Apr-2020 11:42:34.953] INFO [dag-scheduler-event-loop > org.apache.spark.SparkContext:57] q: - Created broadcast 0 from broadcast at > DAGScheduler.scala:1206 > [20-Apr-2020 11:42:34.980] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting 1 missing tasks > from ResultStage 0 (MapPartitionsRDD[6] at save at LoaderImpl.java:305) > (first 15 tasks are for partitions Vector(0)) > [20-Apr-2020 11:42:34.981] INFO [dag-scheduler-event-loop > org.apache.spark.scheduler.TaskSchedulerImpl:57] q: - Adding task set 0.0 > with 1 tasks > Exception in thread "di
[jira] [Commented] (SPARK-31502) document identifier in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-31502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091188#comment-17091188 ] Hyukjin Kwon commented on SPARK-31502: -- PR: https://github.com/apache/spark/pull/28277 > document identifier in SQL Reference > > > Key: SPARK-31502 > URL: https://issues.apache.org/jira/browse/SPARK-31502 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > document identifier in SQL Reference -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31510) Set setwd in R documentation build
[ https://issues.apache.org/jira/browse/SPARK-31510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31510: - Fix Version/s: 3.0.0 > Set setwd in R documentation build > -- > > Key: SPARK-31510 > URL: https://issues.apache.org/jira/browse/SPARK-31510 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 3.0.0 > > > {code} > > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) > Loading required package: usethis > Error: Could not find package root, is your working directory inside a > package? > {code} > Seems like in some environments it fails as above. > https://stackoverflow.com/questions/52670051/how-to-troubleshoot-error-could-not-find-package-root > https://groups.google.com/forum/#!topic/rdevtools/79jjjdc_wjg -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31510) Set setwd in R documentation build
[ https://issues.apache.org/jira/browse/SPARK-31510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31510. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/28285 > Set setwd in R documentation build > -- > > Key: SPARK-31510 > URL: https://issues.apache.org/jira/browse/SPARK-31510 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > > {code} > > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) > Loading required package: usethis > Error: Could not find package root, is your working directory inside a > package? > {code} > Seems like in some environments it fails as above. > https://stackoverflow.com/questions/52670051/how-to-troubleshoot-error-could-not-find-package-root > https://groups.google.com/forum/#!topic/rdevtools/79jjjdc_wjg -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-31530) Spark submit fails if we provide extraJavaOption which contains Xmx as substring
[ https://issues.apache.org/jira/browse/SPARK-31530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-31530: -- > Spark submit fails if we provide extraJavaOption which contains Xmx as > substring > - > > Key: SPARK-31530 > URL: https://issues.apache.org/jira/browse/SPARK-31530 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mayank >Priority: Major > Labels: 2.4.0, Spark, Submit > > Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions > For eg: > {code:java} > bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] > --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" > examples\jars\spark-examples_2.11-2.4.4.jar > Error: Not allowed to specify max heap(Xmx) memory settings through java > options (was -DmyKey=MyValueContainsXmx). Use the corresponding > --driver-memory or spark.driver.memory configuration instead.{code} > https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102 > [https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302] > Can we update the above condition to check more specific for eg -Xmx > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31530) Spark submit fails if we provide extraJavaOption which contains Xmx as substring
[ https://issues.apache.org/jira/browse/SPARK-31530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank updated SPARK-31530: --- Description: Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions For eg: {code:java} bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" examples\jars\spark-examples_2.11-2.4.4.jar Error: Not allowed to specify max heap(Xmx) memory settings through java options (was -DmyKey=MyValueContainsXmx). Use the corresponding --driver-memory or spark.driver.memory configuration instead.{code} https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102 [https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302] Can we update the above condition to check more specific for eg -Xmx was: Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions For eg: {code:java} bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" examples\jars\spark-examples_2.11-2.4.4.jar Error: Not allowed to specify max heap(Xmx) memory settings through java options (was -DmyKey=MyValueContainsXmx). Use the corresponding --driver-memory or spark.driver.memory configuration instead.{code} [https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102 https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302|http://example.com/] Can we update the above condition to check more specific for eg -Xmx > Spark submit fails if we provide extraJavaOption which contains Xmx as > substring > - > > Key: SPARK-31530 > URL: https://issues.apache.org/jira/browse/SPARK-31530 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mayank >Priority: Major > Labels: 2.4.0, Spark, Submit > > Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions > For eg: > {code:java} > bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] > --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" > examples\jars\spark-examples_2.11-2.4.4.jar > Error: Not allowed to specify max heap(Xmx) memory settings through java > options (was -DmyKey=MyValueContainsXmx). Use the corresponding > --driver-memory or spark.driver.memory configuration instead.{code} > https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102 > [https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302] > Can we update the above condition to check more specific for eg -Xmx > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31530) Spark submit fails if we provide extraJavaOption which contains Xmx as substring
[ https://issues.apache.org/jira/browse/SPARK-31530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091184#comment-17091184 ] Hyukjin Kwon commented on SPARK-31530: -- Oh, gotya. > Spark submit fails if we provide extraJavaOption which contains Xmx as > substring > - > > Key: SPARK-31530 > URL: https://issues.apache.org/jira/browse/SPARK-31530 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mayank >Priority: Major > Labels: 2.4.0, Spark, Submit > > Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions > For eg: > {code:java} > bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] > --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" > examples\jars\spark-examples_2.11-2.4.4.jar > Error: Not allowed to specify max heap(Xmx) memory settings through java > options (was -DmyKey=MyValueContainsXmx). Use the corresponding > --driver-memory or spark.driver.memory configuration instead.{code} > https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102 > [https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302] > Can we update the above condition to check more specific for eg -Xmx > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31530) Spark submit fails if we provide extraJavaOption which contains Xmx as substring
[ https://issues.apache.org/jira/browse/SPARK-31530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091182#comment-17091182 ] Mayank commented on SPARK-31530: [~hyukjin.kwon] If you check the description I want to specify some properties in extraJavaOptions which contains Xmx as substring for eg spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx > Spark submit fails if we provide extraJavaOption which contains Xmx as > substring > - > > Key: SPARK-31530 > URL: https://issues.apache.org/jira/browse/SPARK-31530 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mayank >Priority: Major > Labels: 2.4.0, Spark, Submit > > Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions > For eg: > {code:java} > bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] > --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" > examples\jars\spark-examples_2.11-2.4.4.jar > Error: Not allowed to specify max heap(Xmx) memory settings through java > options (was -DmyKey=MyValueContainsXmx). Use the corresponding > --driver-memory or spark.driver.memory configuration instead.{code} > [https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102 > https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302|http://example.com/] > Can we update the above condition to check more specific for eg -Xmx > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31514) Kerberos: Spark UGI credentials are not getting passed down to Hive
[ https://issues.apache.org/jira/browse/SPARK-31514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31514. -- Resolution: Invalid > Kerberos: Spark UGI credentials are not getting passed down to Hive > --- > > Key: SPARK-31514 > URL: https://issues.apache.org/jira/browse/SPARK-31514 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.4 >Reporter: Sanchay Javeria >Priority: Major > > I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to > run a query via the {{spark-sql}} shell. > The simplified setup basically looks like this: spark-sql shell running on > one host in a Yarn cluster -> external hive-metastore running one host -> S3 > to store table data. > When I launch the {{spark-sql}} shell with DEBUG logging enabled, this is > what I see in the logs: > {code:java} > > bin/spark-sql --proxy-user proxy_user > ... > DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for > proxy_user against hive/_h...@realm.com at thrift://hive-metastore:9083 > DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_h...@realm.com > (auth:KERBEROS) > from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130){code} > This means that Spark made a call to fetch the delegation token from the Hive > metastore and then added it to the list of credentials for the UGI. [This is > the piece of > code|https://github.com/apache/spark/blob/branch-2.4/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L129] > that does that. I also verified in the metastore logs that the > {{get_delegation_token()}} call was being made. > Now when I run a simple query like {{create table test_table (id int) > location "s3://some/prefix";}} I get hit with an AWS credentials error. I > modified the hive metastore code and added this right before the file system > in Hadoop is initialized > ([org/apache/hadoop/hive/metastore/Warehouse.java|#L116]): > {code:java} > public static FileSystem getFs(Path f, Configuration conf) throws > MetaException { > try { > UserGroupInformation ugi = UserGroupInformation.getCurrentUser(); > LOG.info("UGI information: " + ugi); > Collection> tokens = > ugi.getCredentials().getAllTokens(); > for(Token token : tokens) { > LOG.info(token); > } > } catch (IOException e) { > e.printStackTrace(); > } > ... > {code} > In the metastore logs, this does print the correct UGI information: > {code:java} > UGI information: proxy_user (auth:PROXY) via hive/hive-metast...@realm.com > (auth:KERBEROS){code} > but there are no tokens present in the UGI. Looks like [Spark > code|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala#L101] > adds it with the alias {{hive.server2.delegation.token}} but I don't see it > in the UGI. This makes me suspect that somehow the UGI scope is isolated and > not being shared between spark-sql and hive metastore. How do I go about > solving this? Any help will be really appreciated! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31514) Kerberos: Spark UGI credentials are not getting passed down to Hive
[ https://issues.apache.org/jira/browse/SPARK-31514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091181#comment-17091181 ] Hyukjin Kwon commented on SPARK-31514: -- It should be best to ask questions into mailing list before filing it as an issue. I guess you could have a better answer there. > Kerberos: Spark UGI credentials are not getting passed down to Hive > --- > > Key: SPARK-31514 > URL: https://issues.apache.org/jira/browse/SPARK-31514 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.4 >Reporter: Sanchay Javeria >Priority: Major > > I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to > run a query via the {{spark-sql}} shell. > The simplified setup basically looks like this: spark-sql shell running on > one host in a Yarn cluster -> external hive-metastore running one host -> S3 > to store table data. > When I launch the {{spark-sql}} shell with DEBUG logging enabled, this is > what I see in the logs: > {code:java} > > bin/spark-sql --proxy-user proxy_user > ... > DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for > proxy_user against hive/_h...@realm.com at thrift://hive-metastore:9083 > DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_h...@realm.com > (auth:KERBEROS) > from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130){code} > This means that Spark made a call to fetch the delegation token from the Hive > metastore and then added it to the list of credentials for the UGI. [This is > the piece of > code|https://github.com/apache/spark/blob/branch-2.4/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L129] > that does that. I also verified in the metastore logs that the > {{get_delegation_token()}} call was being made. > Now when I run a simple query like {{create table test_table (id int) > location "s3://some/prefix";}} I get hit with an AWS credentials error. I > modified the hive metastore code and added this right before the file system > in Hadoop is initialized > ([org/apache/hadoop/hive/metastore/Warehouse.java|#L116]): > {code:java} > public static FileSystem getFs(Path f, Configuration conf) throws > MetaException { > try { > UserGroupInformation ugi = UserGroupInformation.getCurrentUser(); > LOG.info("UGI information: " + ugi); > Collection> tokens = > ugi.getCredentials().getAllTokens(); > for(Token token : tokens) { > LOG.info(token); > } > } catch (IOException e) { > e.printStackTrace(); > } > ... > {code} > In the metastore logs, this does print the correct UGI information: > {code:java} > UGI information: proxy_user (auth:PROXY) via hive/hive-metast...@realm.com > (auth:KERBEROS){code} > but there are no tokens present in the UGI. Looks like [Spark > code|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala#L101] > adds it with the alias {{hive.server2.delegation.token}} but I don't see it > in the UGI. This makes me suspect that somehow the UGI scope is isolated and > not being shared between spark-sql and hive metastore. How do I go about > solving this? Any help will be really appreciated! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091179#comment-17091179 ] Hyukjin Kwon commented on SPARK-31519: -- PR: https://github.com/apache/spark/pull/28294 > Cast in having aggregate expressions returns the wrong result > - > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+--+ > | b| fake| > +---+--+ > | 2|2020-01-01| > +---+--+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---++ > | b|fake| > +---++ > +---++ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31521) The fetch size is not correct when merging blocks into a merged block
[ https://issues.apache.org/jira/browse/SPARK-31521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091178#comment-17091178 ] Hyukjin Kwon commented on SPARK-31521: -- PR: https://github.com/apache/spark/pull/28301 > The fetch size is not correct when merging blocks into a merged block > - > > Key: SPARK-31521 > URL: https://issues.apache.org/jira/browse/SPARK-31521 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > When merging blocks into a merged block, we should count the size of that > merged block as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31524) Add metric to the split number for skew partition when enable AQE
[ https://issues.apache.org/jira/browse/SPARK-31524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091175#comment-17091175 ] Hyukjin Kwon commented on SPARK-31524: -- PR: https://github.com/apache/spark/pull/28109 > Add metric to the split number for skew partition when enable AQE > -- > > Key: SPARK-31524 > URL: https://issues.apache.org/jira/browse/SPARK-31524 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Priority: Major > > Add the details metrics for the split number in skewed partitions when enable > AQE and skew join optimization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31523) LogicalPlan doCanonicalize should throw exception if not resolved
[ https://issues.apache.org/jira/browse/SPARK-31523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091176#comment-17091176 ] Hyukjin Kwon commented on SPARK-31523: -- PR: https://github.com/apache/spark/pull/28304 > LogicalPlan doCanonicalize should throw exception if not resolved > - > > Key: SPARK-31523 > URL: https://issues.apache.org/jira/browse/SPARK-31523 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31527) date add/subtract interval only allow those day precision in ansi mode
[ https://issues.apache.org/jira/browse/SPARK-31527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091174#comment-17091174 ] Hyukjin Kwon commented on SPARK-31527: -- PR https://github.com/apache/spark/pull/28310 > date add/subtract interval only allow those day precision in ansi mode > -- > > Key: SPARK-31527 > URL: https://issues.apache.org/jira/browse/SPARK-31527 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Under ANSI mode, we should not allow date add interval with hours, minutes... > microseconds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31529) Remove extra whitespaces in the formatted explain
[ https://issues.apache.org/jira/browse/SPARK-31529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091172#comment-17091172 ] Hyukjin Kwon commented on SPARK-31529: -- PR: https://github.com/apache/spark/pull/28315 > Remove extra whitespaces in the formatted explain > - > > Key: SPARK-31529 > URL: https://issues.apache.org/jira/browse/SPARK-31529 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > The formatted explain included extra whitespaces. And even the number of > spaces are different between master and branch-3.0, which leads to failed > explain tests if we backport to branch-3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31528) Remove millennium, century, decade from trunc/date_trunc fucntions
[ https://issues.apache.org/jira/browse/SPARK-31528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091173#comment-17091173 ] Hyukjin Kwon commented on SPARK-31528: -- PR: https://github.com/apache/spark/pull/28313 > Remove millennium, century, decade from trunc/date_trunc fucntions > --- > > Key: SPARK-31528 > URL: https://issues.apache.org/jira/browse/SPARK-31528 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Same as #SPARK-31507, millennium, century, and decade are not commonly used > in most modern platforms. > for example > Negative: > https://docs.snowflake.com/en/sql-reference/functions-date-time.html#supported-date-and-time-parts > https://prestodb.io/docs/current/functions/datetime.html#date_trunc > https://teradata.github.io/presto/docs/148t/functions/datetime.html#date_trunc > https://www.oracletutorial.com/oracle-date-functions/oracle-trunc/ > Positive: > https://docs.aws.amazon.com/redshift/latest/dg/r_Dateparts_for_datetime_functions.html > https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31530) Spark submit fails if we provide extraJavaOption which contains Xmx as substring
[ https://issues.apache.org/jira/browse/SPARK-31530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31530. -- Resolution: Won't Fix > Spark submit fails if we provide extraJavaOption which contains Xmx as > substring > - > > Key: SPARK-31530 > URL: https://issues.apache.org/jira/browse/SPARK-31530 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mayank >Priority: Major > Labels: 2.4.0, Spark, Submit > > Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions > For eg: > {code:java} > bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] > --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" > examples\jars\spark-examples_2.11-2.4.4.jar > Error: Not allowed to specify max heap(Xmx) memory settings through java > options (was -DmyKey=MyValueContainsXmx). Use the corresponding > --driver-memory or spark.driver.memory configuration instead.{code} > [https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102 > https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302|http://example.com/] > Can we update the above condition to check more specific for eg -Xmx > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31530) Spark submit fails if we provide extraJavaOption which contains Xmx as substring
[ https://issues.apache.org/jira/browse/SPARK-31530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091171#comment-17091171 ] Hyukjin Kwon commented on SPARK-31530: -- Why don't you follow the guide and use {{spark.driver.memory}}? > Spark submit fails if we provide extraJavaOption which contains Xmx as > substring > - > > Key: SPARK-31530 > URL: https://issues.apache.org/jira/browse/SPARK-31530 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mayank >Priority: Major > Labels: 2.4.0, Spark, Submit > > Spark submit doesn't allow Xmx anywhere in the spark.driver.extraJavaOptions > For eg: > {code:java} > bin\spark-submit --class org.apache.spark.examples.SparkPi --master local[*] > --conf "spark.driver.extraJavaOptions=-DmyKey=MyValueContainsXmx" > examples\jars\spark-examples_2.11-2.4.4.jar > Error: Not allowed to specify max heap(Xmx) memory settings through java > options (was -DmyKey=MyValueContainsXmx). Use the corresponding > --driver-memory or spark.driver.memory configuration instead.{code} > [https://github.com/apache/spark/blob/v2.4.4/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L102 > https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L302|http://example.com/] > Can we update the above condition to check more specific for eg -Xmx > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31531) sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner() method not found during spark-submit
[ https://issues.apache.org/jira/browse/SPARK-31531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31531. -- Resolution: Duplicate > sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner() method not found during > spark-submit > --- > > Key: SPARK-31531 > URL: https://issues.apache.org/jira/browse/SPARK-31531 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.5 >Reporter: shayoni Halder >Priority: Major > Attachments: error.PNG > > > I am trying to run the following Spark submit from a VM using Yarn cluster > mode. > ./spark-submit --master yarn --deploy-mode client test_spark_yarn.py > The VM has java version 11 and spark-2.4.5 while the yarn cluster java 8 and > spark-2.4.0. I am getting the error below: > !error.PNG! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31531) sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner() method not found during spark-submit
[ https://issues.apache.org/jira/browse/SPARK-31531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091170#comment-17091170 ] Hyukjin Kwon commented on SPARK-31531: -- Spark does not support Java 11 in Spark 2.x. It will be supported in Spark 3. > sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner() method not found during > spark-submit > --- > > Key: SPARK-31531 > URL: https://issues.apache.org/jira/browse/SPARK-31531 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.5 >Reporter: shayoni Halder >Priority: Major > Attachments: error.PNG > > > I am trying to run the following Spark submit from a VM using Yarn cluster > mode. > ./spark-submit --master yarn --deploy-mode client test_spark_yarn.py > The VM has java version 11 and spark-2.4.5 while the yarn cluster java 8 and > spark-2.4.0. I am getting the error below: > !error.PNG! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31532) SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession
[ https://issues.apache.org/jira/browse/SPARK-31532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091169#comment-17091169 ] Hyukjin Kwon commented on SPARK-31532: -- PR is in progress at https://github.com/apache/spark/pull/28316 > SparkSessionBuilder shoud not propagate static sql configurations to the > existing active/default SparkSession > - > > Key: SPARK-31532 > URL: https://issues.apache.org/jira/browse/SPARK-31532 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Clearly, this is a bug. > {code:java} > scala> spark.sql("set spark.sql.warehouse.dir").show > +++ > | key| value| > +++ > |spark.sql.warehou...|file:/Users/kenty...| > +++ > scala> spark.sql("set spark.sql.warehouse.dir=2"); > org.apache.spark.sql.AnalysisException: Cannot modify the value of a static > config: spark.sql.warehouse.dir; > at > org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154) > at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) > at > org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100) > at > org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > ... 47 elided > scala> import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.SparkSession > scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get > getClass getOrCreate > scala> SparkSession.builder.config("spark.sql.warehouse.dir", > "xyz").getOrCreate > 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; > some configuration may not take effect. > res7: org.apache.spark.sql.SparkSession = > org.apache.spark.sql.SparkSession@6403d574 > scala> spark.sql("set spark.sql.warehouse.dir").show > ++-+ > | key|value| > ++-+ > |spark.sql.warehou...| xyz| > ++-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31544) Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-31544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091163#comment-17091163 ] Dongjoon Hyun commented on SPARK-31544: --- This is resolved via https://github.com/apache/spark/pull/28320 . > Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from > checkpoint > - > > Key: SPARK-31544 > URL: https://issues.apache.org/jira/browse/SPARK-31544 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.6 > > > Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from > checkpoint > cc [~dongjoon] for if you think this is a good candidate -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30199) Recover spark.ui.port and spark.blockManager.port from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-30199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30199: -- Issue Type: Bug (was: Improvement) > Recover spark.ui.port and spark.blockManager.port from checkpoint > - > > Key: SPARK-30199 > URL: https://issues.apache.org/jira/browse/SPARK-30199 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Aaruna Godthi >Priority: Major > Fix For: 2.4.6, 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30199) Recover spark.ui.port and spark.blockManager.port from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-30199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30199: -- Fix Version/s: 2.4.6 > Recover spark.ui.port and spark.blockManager.port from checkpoint > - > > Key: SPARK-30199 > URL: https://issues.apache.org/jira/browse/SPARK-30199 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Aaruna Godthi >Priority: Major > Fix For: 2.4.6, 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31544) Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-31544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31544. --- Fix Version/s: 2.4.6 Resolution: Fixed > Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from > checkpoint > - > > Key: SPARK-31544 > URL: https://issues.apache.org/jira/browse/SPARK-31544 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.6 > > > Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from > checkpoint > cc [~dongjoon] for if you think this is a good candidate -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31532) SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession
[ https://issues.apache.org/jira/browse/SPARK-31532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091158#comment-17091158 ] Hyukjin Kwon commented on SPARK-31532: -- The problem is that the static configuration was changed during runtime. > SparkSessionBuilder shoud not propagate static sql configurations to the > existing active/default SparkSession > - > > Key: SPARK-31532 > URL: https://issues.apache.org/jira/browse/SPARK-31532 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Clearly, this is a bug. > {code:java} > scala> spark.sql("set spark.sql.warehouse.dir").show > +++ > | key| value| > +++ > |spark.sql.warehou...|file:/Users/kenty...| > +++ > scala> spark.sql("set spark.sql.warehouse.dir=2"); > org.apache.spark.sql.AnalysisException: Cannot modify the value of a static > config: spark.sql.warehouse.dir; > at > org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154) > at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) > at > org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100) > at > org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > ... 47 elided > scala> import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.SparkSession > scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get > getClass getOrCreate > scala> SparkSession.builder.config("spark.sql.warehouse.dir", > "xyz").getOrCreate > 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; > some configuration may not take effect. > res7: org.apache.spark.sql.SparkSession = > org.apache.spark.sql.SparkSession@6403d574 > scala> spark.sql("set spark.sql.warehouse.dir").show > ++-+ > | key|value| > ++-+ > |spark.sql.warehou...| xyz| > ++-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31547) Upgrade Genjavadoc to 0.16
[ https://issues.apache.org/jira/browse/SPARK-31547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31547: Assignee: Dongjoon Hyun > Upgrade Genjavadoc to 0.16 > -- > > Key: SPARK-31547 > URL: https://issues.apache.org/jira/browse/SPARK-31547 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31547) Upgrade Genjavadoc to 0.16
[ https://issues.apache.org/jira/browse/SPARK-31547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31547. -- Fix Version/s: 3.1.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/28321 > Upgrade Genjavadoc to 0.16 > -- > > Key: SPARK-31547 > URL: https://issues.apache.org/jira/browse/SPARK-31547 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31545) Backport SPARK-27676 InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles
[ https://issues.apache.org/jira/browse/SPARK-31545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091152#comment-17091152 ] Hyukjin Kwon commented on SPARK-31545: -- I think no .. it causes a behaviour change which can be pretty critical in SS cases (see the migration guide updated). > Backport SPARK-27676 InMemoryFileIndex should respect > spark.sql.files.ignoreMissingFiles > -- > > Key: SPARK-31545 > URL: https://issues.apache.org/jira/browse/SPARK-31545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-27676 InMemoryFileIndex should respect > spark.sql.files.ignoreMissingFiles > cc [~joshrosen] I think backporting this has been asked in the original > ticket, do you have any objections? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases
[ https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091150#comment-17091150 ] Hyukjin Kwon commented on SPARK-31538: -- Hm, I wonder why we should backport this. This was just a test-only and cleanup. BTW, do we need this file a JIRA for each backport? I think you can just use the existing JIRA, backport, and fix the Fix Veersion. > Backport SPARK-25338 Ensure to call super.beforeAll() and > super.afterAll() in test cases > -- > > Key: SPARK-31538 > URL: https://issues.apache.org/jira/browse/SPARK-31538 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-25338 Ensure to call super.beforeAll() and > super.afterAll() in test cases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31537) Backport SPARK-25559 Remove the unsupported predicates in Parquet when possible
[ https://issues.apache.org/jira/browse/SPARK-31537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091149#comment-17091149 ] Hyukjin Kwon commented on SPARK-31537: -- I wouldn't port this back per the guidelines in our versioning policy (https://spark.apache.org/versioning-policy.html). Improvement seems usually not ported back. > Backport SPARK-25559 Remove the unsupported predicates in Parquet when > possible > > > Key: SPARK-31537 > URL: https://issues.apache.org/jira/browse/SPARK-31537 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.6 > > > Consider backporting SPARK-25559 Remove the unsupported predicates in > Parquet when possible to 2.4.6 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31537) Backport SPARK-25559 Remove the unsupported predicates in Parquet when possible
[ https://issues.apache.org/jira/browse/SPARK-31537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31537: - Target Version/s: 2.4.6 > Backport SPARK-25559 Remove the unsupported predicates in Parquet when > possible > > > Key: SPARK-31537 > URL: https://issues.apache.org/jira/browse/SPARK-31537 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: DB Tsai >Priority: Major > > Consider backporting SPARK-25559 Remove the unsupported predicates in > Parquet when possible to 2.4.6 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31537) Backport SPARK-25559 Remove the unsupported predicates in Parquet when possible
[ https://issues.apache.org/jira/browse/SPARK-31537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31537: - Fix Version/s: (was: 2.4.6) > Backport SPARK-25559 Remove the unsupported predicates in Parquet when > possible > > > Key: SPARK-31537 > URL: https://issues.apache.org/jira/browse/SPARK-31537 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: DB Tsai >Priority: Major > > Consider backporting SPARK-25559 Remove the unsupported predicates in > Parquet when possible to 2.4.6 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31536) Backport SPARK-25407 Allow nested access for non-existent field for Parquet file when nested pruning is enabled
[ https://issues.apache.org/jira/browse/SPARK-31536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091148#comment-17091148 ] Hyukjin Kwon commented on SPARK-31536: -- To backport this, we should port SPARK-31116 together. I tend to think we shouldn't backport this also given that this feature is disabled by default in Spark 2.4 - it also affects the codes when the option is disabled. > Backport SPARK-25407 Allow nested access for non-existent field for > Parquet file when nested pruning is enabled > - > > Key: SPARK-31536 > URL: https://issues.apache.org/jira/browse/SPARK-31536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Consider backporting SPARK-25407 Allow nested access for non-existent > field for Parquet file when nested pruning is enabled to 2.4.6 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31546) Backport SPARK-25595 Ignore corrupt Avro file if flag IGNORE_CORRUPT_FILES enabled
[ https://issues.apache.org/jira/browse/SPARK-31546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091146#comment-17091146 ] Hyukjin Kwon commented on SPARK-31546: -- I think it's fine to port back. > Backport SPARK-25595 Ignore corrupt Avro file if flag > IGNORE_CORRUPT_FILES enabled > > > Key: SPARK-31546 > URL: https://issues.apache.org/jira/browse/SPARK-31546 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-25595 Ignore corrupt Avro file if flag > IGNORE_CORRUPT_FILES enabled > cc [~Gengliang.Wang]& [~hyukjin.kwon] for comments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30804) Measure and log elapsed time for "compact" operation in CompactibleFileStreamLog
[ https://issues.apache.org/jira/browse/SPARK-30804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30804: Assignee: Jungtaek Lim > Measure and log elapsed time for "compact" operation in > CompactibleFileStreamLog > > > Key: SPARK-30804 > URL: https://issues.apache.org/jira/browse/SPARK-30804 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > "compact" operation in FileStreamSourceLog and FileStreamSinkLog is > introduced to solve "small files" problem, but introduced non-trivial latency > which is another headache in long run query. > There're bunch of reports from community for the same issue (see SPARK-24295, > SPARK-29995, SPARK-30462) - before trying to solve the problem, it would be > better to measure the latency (elapsed time) and log to help indicating the > issue when the additional latency becomes concerns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30804) Measure and log elapsed time for "compact" operation in CompactibleFileStreamLog
[ https://issues.apache.org/jira/browse/SPARK-30804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30804. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27557 [https://github.com/apache/spark/pull/27557] > Measure and log elapsed time for "compact" operation in > CompactibleFileStreamLog > > > Key: SPARK-30804 > URL: https://issues.apache.org/jira/browse/SPARK-30804 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > "compact" operation in FileStreamSourceLog and FileStreamSinkLog is > introduced to solve "small files" problem, but introduced non-trivial latency > which is another headache in long run query. > There're bunch of reports from community for the same issue (see SPARK-24295, > SPARK-29995, SPARK-30462) - before trying to solve the problem, it would be > better to measure the latency (elapsed time) and log to help indicating the > issue when the additional latency becomes concerns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly
[ https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-31549: --- Issue Type: Bug (was: Improvement) > Pyspark SparkContext.cancelJobGroup do not work correctly > - > > Key: SPARK-31549 > URL: https://issues.apache.org/jira/browse/SPARK-31549 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0 >Reporter: Weichen Xu >Priority: Critical > > Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue > existing for a long time. This is because of pyspark thread didn't pinned to > jvm thread when invoking java side methods, which leads to all pyspark API > which related to java local thread variables do not work correctly. > (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` > and so on.) > This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode > added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two > issue: > * It is disabled by default. We need to set additional environment variable > to enable it. > * There's memory leak issue which haven't been addressed. > Now there's a series of project like hyperopt-spark, spark-joblib which rely > on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it > is critical to address this issue and we hope it work under default pyspark > mode. An optional approach is implementing methods like > `rdd.setGroupAndCollect`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly
Weichen Xu created SPARK-31549: -- Summary: Pyspark SparkContext.cancelJobGroup do not work correctly Key: SPARK-31549 URL: https://issues.apache.org/jira/browse/SPARK-31549 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.5, 3.0.0 Reporter: Weichen Xu Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue existing for a long time. This is because of pyspark thread didn't pinned to jvm thread when invoking java side methods, which leads to all pyspark API which related to java local thread variables do not work correctly. (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` and so on.) This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two issue: * It is disabled by default. We need to set additional environment variable to enable it. * There's memory leak issue which haven't been addressed. Now there's a series of project like hyperopt-spark, spark-joblib which rely on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it is critical to address this issue and we hope it work under default pyspark mode. An optional approach is implementing methods like `rdd.setGroupAndCollect`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31548) Refactor pyspark code for common methods in JavaParams and Pipeline/OneVsRest
Weichen Xu created SPARK-31548: -- Summary: Refactor pyspark code for common methods in JavaParams and Pipeline/OneVsRest Key: SPARK-31548 URL: https://issues.apache.org/jira/browse/SPARK-31548 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.0 Reporter: Weichen Xu Background: See discussion here https://github.com/apache/spark/pull/28273#discussion_r411462216 and https://github.com/apache/spark/pull/28279#discussion_r412699397 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31488) Support `java.time.LocalDate` in Parquet filter pushdown
[ https://issues.apache.org/jira/browse/SPARK-31488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31488. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28259 [https://github.com/apache/spark/pull/28259] > Support `java.time.LocalDate` in Parquet filter pushdown > > > Key: SPARK-31488 > URL: https://issues.apache.org/jira/browse/SPARK-31488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Currently, ParquetFilters supports only java.sql.Date values of DateType, and > explicitly casts Any to java.sql.Date, see > https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176 > So, any filters refer to date values are not pushed down to Parquet when > spark.sql.datetime.java8API.enabled is true. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31488) Support `java.time.LocalDate` in Parquet filter pushdown
[ https://issues.apache.org/jira/browse/SPARK-31488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31488: --- Assignee: Maxim Gekk > Support `java.time.LocalDate` in Parquet filter pushdown > > > Key: SPARK-31488 > URL: https://issues.apache.org/jira/browse/SPARK-31488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Currently, ParquetFilters supports only java.sql.Date values of DateType, and > explicitly casts Any to java.sql.Date, see > https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176 > So, any filters refer to date values are not pushed down to Parquet when > spark.sql.datetime.java8API.enabled is true. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31526) Add a new test suite for ExperssionInfo
[ https://issues.apache.org/jira/browse/SPARK-31526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31526: Assignee: Takeshi Yamamuro > Add a new test suite for ExperssionInfo > --- > > Key: SPARK-31526 > URL: https://issues.apache.org/jira/browse/SPARK-31526 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31526) Add a new test suite for ExperssionInfo
[ https://issues.apache.org/jira/browse/SPARK-31526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31526. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28308 [https://github.com/apache/spark/pull/28308] > Add a new test suite for ExperssionInfo > --- > > Key: SPARK-31526 > URL: https://issues.apache.org/jira/browse/SPARK-31526 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache
[ https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091075#comment-17091075 ] Jungtaek Lim commented on SPARK-26385: -- The symptoms are mixed up - please clarify whether the exception occurs (driver, AM, executor, somewhere else??), which mode you use, which configuration you used to try mitigating it. Please file a new issue with above information per case. Adding comments with different case in here might emphasize of the importance of the issue, but not helpful to investigate on the issue. Please also note that we need driver / AM / executor log because we should check interaction among them (how delegation tokens were passed). > YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in > cache > --- > > Key: SPARK-26385 > URL: https://issues.apache.org/jira/browse/SPARK-26385 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Hadoop 2.6.0, Spark 2.4.0 >Reporter: T M >Priority: Major > > > Hello, > > I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, > Spark 2.4.0). After 25-26 hours, my job stops working with following error: > {code:java} > 2018-12-16 22:35:17 ERROR > org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query > TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = > a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, > realUser=, issueDate=1544903057122, maxDate=1545507857122, > sequenceNumber=10314, masterKeyId=344) can't be found in cache at > org.apache.hadoop.ipc.Client.call(Client.java:1470) at > org.apache.hadoop.ipc.Client.call(Client.java:1401) at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752) > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at > org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at > org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at > org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at > org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at > org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at > org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at > org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at > org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActi
[jira] [Comment Edited] (SPARK-31532) SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession
[ https://issues.apache.org/jira/browse/SPARK-31532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091067#comment-17091067 ] JinxinTang edited comment on SPARK-31532 at 4/24/20, 1:08 AM: -- Thanks for your issue, these followings may not be allowed to modify after sparksession startup by design: [spark.sql.codegen.comments, spark.sql.queryExecutionListeners, spark.sql.catalogImplementation, spark.sql.subquery.maxThreadThreshold, spark.sql.globalTempDatabase, spark.sql.codegen.cache.maxEntries, spark.sql.filesourceTableRelationCacheSize, spark.sql.streaming.streamingQueryListeners, spark.sql.ui.retainedExecutions, spark.sql.hive.thriftServer.singleSession, spark.sql.extensions, spark.sql.debug, spark.sql.sources.schemaStringLengthThreshold, spark.sql.warehouse.dir] So it is might not a bug. was (Author: jinxintang): Thanks for your issue, these followings not be modified after sparksession startup by design: [spark.sql.codegen.comments, spark.sql.queryExecutionListeners, spark.sql.catalogImplementation, spark.sql.subquery.maxThreadThreshold, spark.sql.globalTempDatabase, spark.sql.codegen.cache.maxEntries, spark.sql.filesourceTableRelationCacheSize, spark.sql.streaming.streamingQueryListeners, spark.sql.ui.retainedExecutions, spark.sql.hive.thriftServer.singleSession, spark.sql.extensions, spark.sql.debug, spark.sql.sources.schemaStringLengthThreshold, spark.sql.warehouse.dir] > SparkSessionBuilder shoud not propagate static sql configurations to the > existing active/default SparkSession > - > > Key: SPARK-31532 > URL: https://issues.apache.org/jira/browse/SPARK-31532 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Clearly, this is a bug. > {code:java} > scala> spark.sql("set spark.sql.warehouse.dir").show > +++ > | key| value| > +++ > |spark.sql.warehou...|file:/Users/kenty...| > +++ > scala> spark.sql("set spark.sql.warehouse.dir=2"); > org.apache.spark.sql.AnalysisException: Cannot modify the value of a static > config: spark.sql.warehouse.dir; > at > org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154) > at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) > at > org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100) > at > org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > ... 47 elided > scala> import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.SparkSession > scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get > getClass getOrCreate > scala> SparkSession.builder.config("spark.sql.warehouse.dir", > "xyz").getOrCreate > 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; > some configuration may not take effect. > res7: org.apache.spark.sql.SparkSession = > org.apache.spark.sql.SparkSession@6403d574 > scala> spark.sql("set spark.sql
[jira] [Commented] (SPARK-31532) SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession
[ https://issues.apache.org/jira/browse/SPARK-31532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091067#comment-17091067 ] JinxinTang commented on SPARK-31532: Thanks for your issue, these followings not be modified after sparksession startup by design: [spark.sql.codegen.comments, spark.sql.queryExecutionListeners, spark.sql.catalogImplementation, spark.sql.subquery.maxThreadThreshold, spark.sql.globalTempDatabase, spark.sql.codegen.cache.maxEntries, spark.sql.filesourceTableRelationCacheSize, spark.sql.streaming.streamingQueryListeners, spark.sql.ui.retainedExecutions, spark.sql.hive.thriftServer.singleSession, spark.sql.extensions, spark.sql.debug, spark.sql.sources.schemaStringLengthThreshold, spark.sql.warehouse.dir] > SparkSessionBuilder shoud not propagate static sql configurations to the > existing active/default SparkSession > - > > Key: SPARK-31532 > URL: https://issues.apache.org/jira/browse/SPARK-31532 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Clearly, this is a bug. > {code:java} > scala> spark.sql("set spark.sql.warehouse.dir").show > +++ > | key| value| > +++ > |spark.sql.warehou...|file:/Users/kenty...| > +++ > scala> spark.sql("set spark.sql.warehouse.dir=2"); > org.apache.spark.sql.AnalysisException: Cannot modify the value of a static > config: spark.sql.warehouse.dir; > at > org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154) > at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) > at > org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100) > at > org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > ... 47 elided > scala> import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.SparkSession > scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get > getClass getOrCreate > scala> SparkSession.builder.config("spark.sql.warehouse.dir", > "xyz").getOrCreate > 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; > some configuration may not take effect. > res7: org.apache.spark.sql.SparkSession = > org.apache.spark.sql.SparkSession@6403d574 > scala> spark.sql("set spark.sql.warehouse.dir").show > ++-+ > | key|value| > ++-+ > |spark.sql.warehou...| xyz| > ++-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31542) Backport SPARK-25692 Remove static initialization of worker eventLoop handling chunk fetch requests within TransportContext. This fixes ChunkFetchIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-31542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-31542. -- Resolution: Not A Problem > Backport SPARK-25692 Remove static initialization of worker eventLoop > handling chunk fetch requests within TransportContext. This fixes > ChunkFetchIntegrationSuite as well > > > Key: SPARK-31542 > URL: https://issues.apache.org/jira/browse/SPARK-31542 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-25692 Remove static initialization of worker eventLoop > handling chunk fetch requests within TransportContext. This fixes > ChunkFetchIntegrationSuite as well. > While the test was only flaky in the 3.0 branch, it seems possible the same > code path could be triggered in 2.4 so consider for backport. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31542) Backport SPARK-25692 Remove static initialization of worker eventLoop handling chunk fetch requests within TransportContext. This fixes ChunkFetchIntegrationSuit
[ https://issues.apache.org/jira/browse/SPARK-31542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091060#comment-17091060 ] Shixiong Zhu commented on SPARK-31542: -- [~holden] The flaky test was caused by a new improvement in 3.0: SPARK-24355 It doesn't impact branch-2.4. > Backport SPARK-25692 Remove static initialization of worker eventLoop > handling chunk fetch requests within TransportContext. This fixes > ChunkFetchIntegrationSuite as well > > > Key: SPARK-31542 > URL: https://issues.apache.org/jira/browse/SPARK-31542 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-25692 Remove static initialization of worker eventLoop > handling chunk fetch requests within TransportContext. This fixes > ChunkFetchIntegrationSuite as well. > While the test was only flaky in the 3.0 branch, it seems possible the same > code path could be triggered in 2.4 so consider for backport. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31464) Upgrade Kafka to 2.5.0
[ https://issues.apache.org/jira/browse/SPARK-31464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091051#comment-17091051 ] Dongjoon Hyun commented on SPARK-31464: --- Thank you, [~ijuma]! :) > Upgrade Kafka to 2.5.0 > -- > > Key: SPARK-31464 > URL: https://issues.apache.org/jira/browse/SPARK-31464 > Project: Spark > Issue Type: Improvement > Components: Build, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-27891. -- Resolution: Cannot Reproduce SPARK-23361 is in Spark 2.4.0 and the fix is not going to be 2.3.x as 2.3.x is EOL - please reopen if anyone encounters this in 2.4.x. > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Critical > Attachments: application_1559242207407_0001.log, > spark_2.3.1_failure.log > > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > Following command was used to submit the spark job > bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab > --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py > /tmp/ff1.txt > > Application Logs attached > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31547) Upgrade Genjavadoc to 0.16
Dongjoon Hyun created SPARK-31547: - Summary: Upgrade Genjavadoc to 0.16 Key: SPARK-31547 URL: https://issues.apache.org/jira/browse/SPARK-31547 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.1.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090996#comment-17090996 ] Dongjoon Hyun commented on SPARK-25075: --- Thank you for updates, [~smarter]. > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Priority: Major > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31540) Backport SPARK-27981 Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-31540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090954#comment-17090954 ] Holden Karau commented on SPARK-31540: -- I was thinking some folks might build 2.4 with newer JDKs. > Backport SPARK-27981 Remove `Illegal reflective access` warning for > `java.nio.Bits.unaligned()` in JDK9+ > -- > > Key: SPARK-31540 > URL: https://issues.apache.org/jira/browse/SPARK-31540 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > SPARK-27981 Remove `Illegal reflective access` warning for > `java.nio.Bits.unaligned()` in JDK9+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31540) Backport SPARK-27981 Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-31540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090944#comment-17090944 ] Sean R. Owen commented on SPARK-31540: -- The backport is probably harmless, but, why is it needed for 2.4.x? This helps JDK 11 compatibility, but 2.4 won't work with JDK 11. > Backport SPARK-27981 Remove `Illegal reflective access` warning for > `java.nio.Bits.unaligned()` in JDK9+ > -- > > Key: SPARK-31540 > URL: https://issues.apache.org/jira/browse/SPARK-31540 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > SPARK-27981 Remove `Illegal reflective access` warning for > `java.nio.Bits.unaligned()` in JDK9+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31544) Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-31544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090928#comment-17090928 ] Holden Karau commented on SPARK-31544: -- Thanks! > Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from > checkpoint > - > > Key: SPARK-31544 > URL: https://issues.apache.org/jira/browse/SPARK-31544 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: Dongjoon Hyun >Priority: Major > > Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from > checkpoint > cc [~dongjoon] for if you think this is a good candidate -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31544) Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-31544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090926#comment-17090926 ] Dongjoon Hyun commented on SPARK-31544: --- I made a PR, [~holden]. - https://github.com/apache/spark/pull/28320 > Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from > checkpoint > - > > Key: SPARK-31544 > URL: https://issues.apache.org/jira/browse/SPARK-31544 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: Dongjoon Hyun >Priority: Major > > Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from > checkpoint > cc [~dongjoon] for if you think this is a good candidate -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31540) Backport SPARK-27981 Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-31540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090924#comment-17090924 ] Holden Karau commented on SPARK-31540: -- Gotcha, I'll go through the backport JIRAs and link. > Backport SPARK-27981 Remove `Illegal reflective access` warning for > `java.nio.Bits.unaligned()` in JDK9+ > -- > > Key: SPARK-31540 > URL: https://issues.apache.org/jira/browse/SPARK-31540 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > SPARK-27981 Remove `Illegal reflective access` warning for > `java.nio.Bits.unaligned()` in JDK9+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31540) Backport SPARK-27981 Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-31540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090922#comment-17090922 ] Dongjoon Hyun commented on SPARK-31540: --- cc [~srowen] > Backport SPARK-27981 Remove `Illegal reflective access` warning for > `java.nio.Bits.unaligned()` in JDK9+ > -- > > Key: SPARK-31540 > URL: https://issues.apache.org/jira/browse/SPARK-31540 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > SPARK-27981 Remove `Illegal reflective access` warning for > `java.nio.Bits.unaligned()` in JDK9+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31540) Backport SPARK-27981 Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-31540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090921#comment-17090921 ] Dongjoon Hyun commented on SPARK-31540: --- [~holden]. Could you link the original JIRA additionally? Embedding to the description doesn't provide a bi-directional visibility. > Backport SPARK-27981 Remove `Illegal reflective access` warning for > `java.nio.Bits.unaligned()` in JDK9+ > -- > > Key: SPARK-31540 > URL: https://issues.apache.org/jira/browse/SPARK-31540 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > SPARK-27981 Remove `Illegal reflective access` warning for > `java.nio.Bits.unaligned()` in JDK9+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31544) Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-31544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090917#comment-17090917 ] Dongjoon Hyun commented on SPARK-31544: --- BTW, I kept the original authorship from the beginning. This will be the same for `branch-2.4`. > Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from > checkpoint > - > > Key: SPARK-31544 > URL: https://issues.apache.org/jira/browse/SPARK-31544 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: Dongjoon Hyun >Priority: Major > > Backport SPARK-30199 Recover `spark.(ui|blockManager).port` from > checkpoint > cc [~dongjoon] for if you think this is a good candidate -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31539) Backport SPARK-27138 Remove AdminUtils calls (fixes deprecation)
[ https://issues.apache.org/jira/browse/SPARK-31539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090915#comment-17090915 ] Holden Karau commented on SPARK-31539: -- The main reason I'd see to backport the change is it's only in test and if someone wants to build & test with a newer Kafka library it might be useful. But now that I think about it some more it's probably not worth it, I'll close as won't fix. > Backport SPARK-27138 Remove AdminUtils calls (fixes deprecation) > -- > > Key: SPARK-31539 > URL: https://issues.apache.org/jira/browse/SPARK-31539 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > SPARK-27138 Remove AdminUtils calls (fixes deprecation) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31539) Backport SPARK-27138 Remove AdminUtils calls (fixes deprecation)
[ https://issues.apache.org/jira/browse/SPARK-31539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-31539. -- Resolution: Won't Fix > Backport SPARK-27138 Remove AdminUtils calls (fixes deprecation) > -- > > Key: SPARK-31539 > URL: https://issues.apache.org/jira/browse/SPARK-31539 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > SPARK-27138 Remove AdminUtils calls (fixes deprecation) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31539) Backport SPARK-27138 Remove AdminUtils calls (fixes deprecation)
[ https://issues.apache.org/jira/browse/SPARK-31539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090914#comment-17090914 ] Holden Karau commented on SPARK-31539: -- The other Jira is marked as resolved and I want to track the outstanding issues for 2.4.6 to make sure we don't leave anything behind. > Backport SPARK-27138 Remove AdminUtils calls (fixes deprecation) > -- > > Key: SPARK-31539 > URL: https://issues.apache.org/jira/browse/SPARK-31539 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > SPARK-27138 Remove AdminUtils calls (fixes deprecation) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31485) Barrier stage can hang if only partial tasks launched
[ https://issues.apache.org/jira/browse/SPARK-31485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-31485: - Shepherd: Holden Karau > Barrier stage can hang if only partial tasks launched > - > > Key: SPARK-31485 > URL: https://issues.apache.org/jira/browse/SPARK-31485 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wuyi >Priority: Major > > The issue can be reproduced by following test: > > {code:java} > initLocalClusterSparkContext(2) > val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2) > val dep = new OneToOneDependency[Int](rdd0) > val rdd = new MyRDD(sc, 2, List(dep), > Seq(Seq("executor_h_0"),Seq("executor_h_0"))) > rdd.barrier().mapPartitions { iter => > BarrierTaskContext.get().barrier() > iter > }.collect() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31485) Barrier stage can hang if only partial tasks launched
[ https://issues.apache.org/jira/browse/SPARK-31485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-31485: - Target Version/s: 2.4.6, 3.0.0 > Barrier stage can hang if only partial tasks launched > - > > Key: SPARK-31485 > URL: https://issues.apache.org/jira/browse/SPARK-31485 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wuyi >Priority: Major > > The issue can be reproduced by following test: > > {code:java} > initLocalClusterSparkContext(2) > val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2) > val dep = new OneToOneDependency[Int](rdd0) > val rdd = new MyRDD(sc, 2, List(dep), > Seq(Seq("executor_h_0"),Seq("executor_h_0"))) > rdd.barrier().mapPartitions { iter => > BarrierTaskContext.get().barrier() > iter > }.collect() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31543) Backport SPARK-26306 More memory to de-flake SorterSuite
[ https://issues.apache.org/jira/browse/SPARK-31543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090911#comment-17090911 ] Sean R. Owen commented on SPARK-31543: -- Does it need to be a new JIRA? but if it's a simple test change, I think it's plausible to back-port, if it affects 2.4.x. > Backport SPARK-26306 More memory to de-flake SorterSuite > -- > > Key: SPARK-31543 > URL: https://issues.apache.org/jira/browse/SPARK-31543 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > SPARK-26306 More memory to de-flake SorterSuite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org