[jira] [Commented] (SPARK-36539) trimNonTopLevelAlias should not change StructType inner alias
[ https://issues.apache.org/jira/browse/SPARK-36539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400818#comment-17400818 ] Apache Spark commented on SPARK-36539: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33778 > trimNonTopLevelAlias should not change StructType inner alias > - > > Key: SPARK-36539 > URL: https://issues.apache.org/jira/browse/SPARK-36539 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36539) trimNonTopLevelAlias should not change StructType inner alias
[ https://issues.apache.org/jira/browse/SPARK-36539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36539: Assignee: (was: Apache Spark) > trimNonTopLevelAlias should not change StructType inner alias > - > > Key: SPARK-36539 > URL: https://issues.apache.org/jira/browse/SPARK-36539 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36539) trimNonTopLevelAlias should not change StructType inner alias
[ https://issues.apache.org/jira/browse/SPARK-36539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36539: Assignee: Apache Spark > trimNonTopLevelAlias should not change StructType inner alias > - > > Key: SPARK-36539 > URL: https://issues.apache.org/jira/browse/SPARK-36539 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36539) trimNonTopLevelAlias should not change StructType inner alias
[ https://issues.apache.org/jira/browse/SPARK-36539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400816#comment-17400816 ] Apache Spark commented on SPARK-36539: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33778 > trimNonTopLevelAlias should not change StructType inner alias > - > > Key: SPARK-36539 > URL: https://issues.apache.org/jira/browse/SPARK-36539 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36539) trimNonTopLevelAlias should not change StructType inner alias
angerszhu created SPARK-36539: - Summary: trimNonTopLevelAlias should not change StructType inner alias Key: SPARK-36539 URL: https://issues.apache.org/jira/browse/SPARK-36539 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36465) Dynamic gap duration in session window
[ https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400805#comment-17400805 ] L. C. Hsieh commented on SPARK-36465: - Thanks [~Gengliang.Wang]! > Dynamic gap duration in session window > -- > > Key: SPARK-36465 > URL: https://issues.apache.org/jira/browse/SPARK-36465 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > The gap duration used in session window for now is a static value. To support > more complex usage, it is better to support dynamic gap duration which > determines the gap duration by looking at the current data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36465) Dynamic gap duration in session window
[ https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400803#comment-17400803 ] Gengliang Wang edited comment on SPARK-36465 at 8/18/21, 5:51 AM: -- [~viirya][~kabhwan] FYI I converted this one as a sub-task of SPARK-10816. was (Author: gengliang.wang): [~viirya][~kabhwan]FYI I converted this one as a sub-task of SPARK-10816. > Dynamic gap duration in session window > -- > > Key: SPARK-36465 > URL: https://issues.apache.org/jira/browse/SPARK-36465 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > The gap duration used in session window for now is a static value. To support > more complex usage, it is better to support dynamic gap duration which > determines the gap duration by looking at the current data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36465) Dynamic gap duration in session window
[ https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400803#comment-17400803 ] Gengliang Wang commented on SPARK-36465: [~viirya][~kabhwan]FYI I converted this one as a sub-task of SPARK-10816. > Dynamic gap duration in session window > -- > > Key: SPARK-36465 > URL: https://issues.apache.org/jira/browse/SPARK-36465 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > The gap duration used in session window for now is a static value. To support > more complex usage, it is better to support dynamic gap duration which > determines the gap duration by looking at the current data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36465) Dynamic gap duration in session window
[ https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-36465: --- Parent: SPARK-10816 Issue Type: Sub-task (was: Improvement) > Dynamic gap duration in session window > -- > > Key: SPARK-36465 > URL: https://issues.apache.org/jira/browse/SPARK-36465 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > The gap duration used in session window for now is a static value. To support > more complex usage, it is better to support dynamic gap duration which > determines the gap duration by looking at the current data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36538) Environment variables part in config doc isn't properly documented.
[ https://issues.apache.org/jira/browse/SPARK-36538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400791#comment-17400791 ] Apache Spark commented on SPARK-36538: -- User 'yutoacts' has created a pull request for this issue: https://github.com/apache/spark/pull/33777 > Environment variables part in config doc isn't properly documented. > --- > > Key: SPARK-36538 > URL: https://issues.apache.org/jira/browse/SPARK-36538 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2 >Reporter: Yuto Akutsu >Priority: Major > > It says environment variables are not reflected through spark-env.sh in YARN > cluster mode but I believe they are. I think this part of the document should > be removed. > [https://spark.apache.org/docs/latest/configuration.html#environment-variables] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36538) Environment variables part in config doc isn't properly documented.
[ https://issues.apache.org/jira/browse/SPARK-36538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36538: Assignee: Apache Spark > Environment variables part in config doc isn't properly documented. > --- > > Key: SPARK-36538 > URL: https://issues.apache.org/jira/browse/SPARK-36538 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2 >Reporter: Yuto Akutsu >Assignee: Apache Spark >Priority: Major > > It says environment variables are not reflected through spark-env.sh in YARN > cluster mode but I believe they are. I think this part of the document should > be removed. > [https://spark.apache.org/docs/latest/configuration.html#environment-variables] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36538) Environment variables part in config doc isn't properly documented.
[ https://issues.apache.org/jira/browse/SPARK-36538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36538: Assignee: (was: Apache Spark) > Environment variables part in config doc isn't properly documented. > --- > > Key: SPARK-36538 > URL: https://issues.apache.org/jira/browse/SPARK-36538 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2 >Reporter: Yuto Akutsu >Priority: Major > > It says environment variables are not reflected through spark-env.sh in YARN > cluster mode but I believe they are. I think this part of the document should > be removed. > [https://spark.apache.org/docs/latest/configuration.html#environment-variables] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36386) Fix DataFrame groupby-expanding to follow pandas 1.3
[ https://issues.apache.org/jira/browse/SPARK-36386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400777#comment-17400777 ] Apache Spark commented on SPARK-36386: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/33776 > Fix DataFrame groupby-expanding to follow pandas 1.3 > > > Key: SPARK-36386 > URL: https://issues.apache.org/jira/browse/SPARK-36386 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36388) Fix DataFrame groupby-rolling to follow pandas 1.3
[ https://issues.apache.org/jira/browse/SPARK-36388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400774#comment-17400774 ] Apache Spark commented on SPARK-36388: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/33776 > Fix DataFrame groupby-rolling to follow pandas 1.3 > -- > > Key: SPARK-36388 > URL: https://issues.apache.org/jira/browse/SPARK-36388 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36386) Fix DataFrame groupby-expanding to follow pandas 1.3
[ https://issues.apache.org/jira/browse/SPARK-36386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400775#comment-17400775 ] Apache Spark commented on SPARK-36386: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/33776 > Fix DataFrame groupby-expanding to follow pandas 1.3 > > > Key: SPARK-36386 > URL: https://issues.apache.org/jira/browse/SPARK-36386 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36398) Redact sensitive information in Spark Thrift Server log
[ https://issues.apache.org/jira/browse/SPARK-36398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-36398. Fix Version/s: 3.1.3 3.2.0 Assignee: Kousuke Saruta Resolution: Fixed > Redact sensitive information in Spark Thrift Server log > --- > > Key: SPARK-36398 > URL: https://issues.apache.org/jira/browse/SPARK-36398 > Project: Spark > Issue Type: Bug > Components: Security, SQL >Affects Versions: 3.1.2 >Reporter: Denis Krivenko >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0, 3.1.3 > > > Spark Thrift Server logs query without sensitive information redaction in > [org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.scala|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L188] > {code:scala} > override def runInternal(): Unit = { > setState(OperationState.PENDING) > logInfo(s"Submitting query '$statement' with $statementId") > {code} > Logs > {code:sh} > 21/08/03 20:49:46 INFO SparkExecuteStatementOperation: Submitting query > 'CREATE OR REPLACE TEMPORARY VIEW test_view > USING org.apache.spark.sql.jdbc > OPTIONS ( > url="jdbc:mysql://example.com:3306", > driver="com.mysql.jdbc.Driver", > dbtable="example.test", > user="my_username", > password="my_password" > )' with 37e5d2cb-aa96-407e-b589-7cb212324100 > 21/08/03 20:49:46 INFO SparkExecuteStatementOperation: Running query with > 37e5d2cb-aa96-407e-b589-7cb212324100 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36400) Redact sensitive information in Spark Thrift Server UI
[ https://issues.apache.org/jira/browse/SPARK-36400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-36400. Fix Version/s: 3.1.3 3.2.0 Assignee: Kousuke Saruta Resolution: Fixed > Redact sensitive information in Spark Thrift Server UI > -- > > Key: SPARK-36400 > URL: https://issues.apache.org/jira/browse/SPARK-36400 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: Denis Krivenko >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0, 3.1.3 > > Attachments: SQL Statistics.png > > > Spark UI displays sensitive information on "JDBC/ODBC Server" tab > The reason of the issue is in > [org.apache.spark.sql.hive.thriftserver.ui.SqlStatsPagedTable|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L166] > class > [here|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L266-L268] > {code:scala} > > > {info.statement} > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36538) Environment variables part in config doc isn't properly documented.
[ https://issues.apache.org/jira/browse/SPARK-36538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuto Akutsu updated SPARK-36538: Description: It says environment variables are not reflected through spark-env.sh in YARN cluster mode but I believe they are. I think this part of the document should be removed. [https://spark.apache.org/docs/latest/configuration.html#environment-variables] was: It says environment variables are not reflected through spark-env.sh in YARN cluster mode although they are. I think this part of the document should be removed. https://spark.apache.org/docs/latest/configuration.html#environment-variables > Environment variables part in config doc isn't properly documented. > --- > > Key: SPARK-36538 > URL: https://issues.apache.org/jira/browse/SPARK-36538 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2 >Reporter: Yuto Akutsu >Priority: Major > > It says environment variables are not reflected through spark-env.sh in YARN > cluster mode but I believe they are. I think this part of the document should > be removed. > [https://spark.apache.org/docs/latest/configuration.html#environment-variables] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36428) the 'seconds' parameter of 'make_timestamp' should accept integer type
[ https://issues.apache.org/jira/browse/SPARK-36428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400740#comment-17400740 ] Apache Spark commented on SPARK-36428: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/33775 > the 'seconds' parameter of 'make_timestamp' should accept integer type > -- > > Key: SPARK-36428 > URL: https://issues.apache.org/jira/browse/SPARK-36428 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > Fix For: 3.2.0 > > > With ANSI mode, {{SELECT make_timestamp(1, 1, 1, 1, 1, 1)}} fails, because > the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be > implicitly casted to DECIMAL(8,6) under ANSI mode. > We should update the function {{make_timestamp}} to allow integer type > 'seconds' parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36538) Environment variables part in config doc isn't properly documented.
Yuto Akutsu created SPARK-36538: --- Summary: Environment variables part in config doc isn't properly documented. Key: SPARK-36538 URL: https://issues.apache.org/jira/browse/SPARK-36538 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.1.2 Reporter: Yuto Akutsu It says environment variables are not reflected through spark-env.sh in YARN cluster mode although they are. I think this part of the document should be removed. https://spark.apache.org/docs/latest/configuration.html#environment-variables -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36519) Store the RocksDB format in the checkpoint for a streaming query
[ https://issues.apache.org/jira/browse/SPARK-36519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400736#comment-17400736 ] Gengliang Wang commented on SPARK-36519: [~zsxwing] FYI I am converting this one as sub-task of SPARK-34198 > Store the RocksDB format in the checkpoint for a streaming query > > > Key: SPARK-36519 > URL: https://issues.apache.org/jira/browse/SPARK-36519 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > RocksDB provides backward compatibility but it doesn't always provide forward > compatibility. It's better to store the RocksDB format version in the > checkpoint so that it would give us more information to provide the rollback > guarantee when we upgrade the RocksDB version that may introduce incompatible > change in a new Spark version. > A typical case is when a user upgrades their query to a new Spark version, > and this new Spark version has a new RocksDB version which may use a new > format. But the user hits some bug and decide to rollback. But in the old > Spark version, the old RocksDB version cannot read the new format. > In order to handle this case, we will write the RocksDB format version to the > checkpoint. When restarting from a checkpoint, we will force RocksDB to use > the format version stored in the checkpoint. This will ensure the user can > rollback their Spark version if needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36519) Store the RocksDB format in the checkpoint for a streaming query
[ https://issues.apache.org/jira/browse/SPARK-36519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-36519: --- Parent: SPARK-34198 Issue Type: Sub-task (was: Improvement) > Store the RocksDB format in the checkpoint for a streaming query > > > Key: SPARK-36519 > URL: https://issues.apache.org/jira/browse/SPARK-36519 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > RocksDB provides backward compatibility but it doesn't always provide forward > compatibility. It's better to store the RocksDB format version in the > checkpoint so that it would give us more information to provide the rollback > guarantee when we upgrade the RocksDB version that may introduce incompatible > change in a new Spark version. > A typical case is when a user upgrades their query to a new Spark version, > and this new Spark version has a new RocksDB version which may use a new > format. But the user hits some bug and decide to rollback. But in the old > Spark version, the old RocksDB version cannot read the new format. > In order to handle this case, we will write the RocksDB format version to the > checkpoint. When restarting from a checkpoint, we will force RocksDB to use > the format version stored in the checkpoint. This will ensure the user can > rollback their Spark version if needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36303) Refactor fourteenth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400717#comment-17400717 ] Apache Spark commented on SPARK-36303: -- User 'dgd-contributor' has created a pull request for this issue: https://github.com/apache/spark/pull/33774 > Refactor fourteenth set of 20 query execution errors to use error classes > - > > Key: SPARK-36303 > URL: https://issues.apache.org/jira/browse/SPARK-36303 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the fourteenth set of 20. > {code:java} > cannotGetEventTimeWatermarkError > cannotSetTimeoutTimestampError > batchMetadataFileNotFoundError > multiStreamingQueriesUsingPathConcurrentlyError > addFilesWithAbsolutePathUnsupportedError > microBatchUnsupportedByDataSourceError > cannotExecuteStreamingRelationExecError > invalidStreamingOutputModeError > catalogPluginClassNotFoundError > catalogPluginClassNotImplementedError > catalogPluginClassNotFoundForCatalogError > catalogFailToFindPublicNoArgConstructorError > catalogFailToCallPublicNoArgConstructorError > cannotInstantiateAbstractCatalogPluginClassError > failedToInstantiateConstructorForCatalogError > noSuchElementExceptionError > noSuchElementExceptionError > cannotMutateReadOnlySQLConfError > cannotCloneOrCopyReadOnlySQLConfError > cannotGetSQLConfInSchedulerEventLoopThreadError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36303) Refactor fourteenth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36303: Assignee: Apache Spark > Refactor fourteenth set of 20 query execution errors to use error classes > - > > Key: SPARK-36303 > URL: https://issues.apache.org/jira/browse/SPARK-36303 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Assignee: Apache Spark >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the fourteenth set of 20. > {code:java} > cannotGetEventTimeWatermarkError > cannotSetTimeoutTimestampError > batchMetadataFileNotFoundError > multiStreamingQueriesUsingPathConcurrentlyError > addFilesWithAbsolutePathUnsupportedError > microBatchUnsupportedByDataSourceError > cannotExecuteStreamingRelationExecError > invalidStreamingOutputModeError > catalogPluginClassNotFoundError > catalogPluginClassNotImplementedError > catalogPluginClassNotFoundForCatalogError > catalogFailToFindPublicNoArgConstructorError > catalogFailToCallPublicNoArgConstructorError > cannotInstantiateAbstractCatalogPluginClassError > failedToInstantiateConstructorForCatalogError > noSuchElementExceptionError > noSuchElementExceptionError > cannotMutateReadOnlySQLConfError > cannotCloneOrCopyReadOnlySQLConfError > cannotGetSQLConfInSchedulerEventLoopThreadError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36303) Refactor fourteenth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36303: Assignee: (was: Apache Spark) > Refactor fourteenth set of 20 query execution errors to use error classes > - > > Key: SPARK-36303 > URL: https://issues.apache.org/jira/browse/SPARK-36303 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the fourteenth set of 20. > {code:java} > cannotGetEventTimeWatermarkError > cannotSetTimeoutTimestampError > batchMetadataFileNotFoundError > multiStreamingQueriesUsingPathConcurrentlyError > addFilesWithAbsolutePathUnsupportedError > microBatchUnsupportedByDataSourceError > cannotExecuteStreamingRelationExecError > invalidStreamingOutputModeError > catalogPluginClassNotFoundError > catalogPluginClassNotImplementedError > catalogPluginClassNotFoundForCatalogError > catalogFailToFindPublicNoArgConstructorError > catalogFailToCallPublicNoArgConstructorError > cannotInstantiateAbstractCatalogPluginClassError > failedToInstantiateConstructorForCatalogError > noSuchElementExceptionError > noSuchElementExceptionError > cannotMutateReadOnlySQLConfError > cannotCloneOrCopyReadOnlySQLConfError > cannotGetSQLConfInSchedulerEventLoopThreadError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36303) Refactor fourteenth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400716#comment-17400716 ] Apache Spark commented on SPARK-36303: -- User 'dgd-contributor' has created a pull request for this issue: https://github.com/apache/spark/pull/33774 > Refactor fourteenth set of 20 query execution errors to use error classes > - > > Key: SPARK-36303 > URL: https://issues.apache.org/jira/browse/SPARK-36303 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the fourteenth set of 20. > {code:java} > cannotGetEventTimeWatermarkError > cannotSetTimeoutTimestampError > batchMetadataFileNotFoundError > multiStreamingQueriesUsingPathConcurrentlyError > addFilesWithAbsolutePathUnsupportedError > microBatchUnsupportedByDataSourceError > cannotExecuteStreamingRelationExecError > invalidStreamingOutputModeError > catalogPluginClassNotFoundError > catalogPluginClassNotImplementedError > catalogPluginClassNotFoundForCatalogError > catalogFailToFindPublicNoArgConstructorError > catalogFailToCallPublicNoArgConstructorError > cannotInstantiateAbstractCatalogPluginClassError > failedToInstantiateConstructorForCatalogError > noSuchElementExceptionError > noSuchElementExceptionError > cannotMutateReadOnlySQLConfError > cannotCloneOrCopyReadOnlySQLConfError > cannotGetSQLConfInSchedulerEventLoopThreadError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache
[ https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400696#comment-17400696 ] Apache Spark commented on SPARK-34309: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/33772 > Use Caffeine instead of Guava Cache > --- > > Key: SPARK-34309 > URL: https://issues.apache.org/jira/browse/SPARK-34309 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png > > > Caffeine is a high performance, near optimal caching library based on Java 8, > it is used in a similar way to guava cache, but with better performance. The > comparison results are as follow are on the [caffeine benchmarks > |https://github.com/ben-manes/caffeine/wiki/Benchmarks] > At the same time, caffeine has been used in some open source projects like > Cassandra, Hbase, Neo4j, Druid, Spring and so on. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache
[ https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400695#comment-17400695 ] Apache Spark commented on SPARK-34309: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/33772 > Use Caffeine instead of Guava Cache > --- > > Key: SPARK-34309 > URL: https://issues.apache.org/jira/browse/SPARK-34309 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png > > > Caffeine is a high performance, near optimal caching library based on Java 8, > it is used in a similar way to guava cache, but with better performance. The > comparison results are as follow are on the [caffeine benchmarks > |https://github.com/ben-manes/caffeine/wiki/Benchmarks] > At the same time, caffeine has been used in some open source projects like > Cassandra, Hbase, Neo4j, Druid, Spring and so on. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34415) Use randomization as a possibly better technique than grid search in optimizing hyperparameters
[ https://issues.apache.org/jira/browse/SPARK-34415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400655#comment-17400655 ] Sean R. Owen commented on SPARK-34415: -- I agree; I think it was mostly as it makes it simple to extend and reuse the param grid builder rather than reimplement a fair bit more code that uses it. It isn't as useful as generation random samples each time. Hm, on a second look though, couldn't the new class override build() to generate a bunch of actually randomly-sampled combinations? that part is easy I think, but then the question is, how many combinations to return? that would need a new API somewhere. You could argue this is a bit misleading as the caller may expect it to generate random samples not randomly generate the grid. Hm, I'm retroactively on the fence about it. Is it worth trying to redesign quickly for 3.2.0? maybe a small impl and API change can support what this might be expected to do. Leave it? revert? > Use randomization as a possibly better technique than grid search in > optimizing hyperparameters > --- > > Key: SPARK-34415 > URL: https://issues.apache.org/jira/browse/SPARK-34415 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 3.0.1 >Reporter: Phillip Henry >Assignee: Phillip Henry >Priority: Minor > Labels: pull-request-available > Fix For: 3.2.0 > > > Randomization can be a more effective techinique than a grid search in > finding optimal hyperparameters since min/max points can fall between the > grid lines and never be found. Randomisation is not so restricted although > the probability of finding minima/maxima is dependent on the number of > attempts. > Alice Zheng has an accessible description on how this technique works at > [https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html] > (Note that I have a PR for this work outstanding at > [https://github.com/apache/spark/pull/31535] ) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34415) Use randomization as a possibly better technique than grid search in optimizing hyperparameters
[ https://issues.apache.org/jira/browse/SPARK-34415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400644#comment-17400644 ] Xiangrui Meng commented on SPARK-34415: --- [~phenry] [~srowen] The implementation doesn't do uniform sampling of the hyper-parameter search space. Instead, it samples per params and then construct the cartesian product of all combinations. I think this would significantly reduce the effectiveness of the random search. Was it already discussed? > Use randomization as a possibly better technique than grid search in > optimizing hyperparameters > --- > > Key: SPARK-34415 > URL: https://issues.apache.org/jira/browse/SPARK-34415 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 3.0.1 >Reporter: Phillip Henry >Assignee: Phillip Henry >Priority: Minor > Labels: pull-request-available > Fix For: 3.2.0 > > > Randomization can be a more effective techinique than a grid search in > finding optimal hyperparameters since min/max points can fall between the > grid lines and never be found. Randomisation is not so restricted although > the probability of finding minima/maxima is dependent on the number of > attempts. > Alice Zheng has an accessible description on how this technique works at > [https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html] > (Note that I have a PR for this work outstanding at > [https://github.com/apache/spark/pull/31535] ) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36493) Skip Retrieving keytab with SparkFiles.get if keytab found in the CWD of Yarn Container
[ https://issues.apache.org/jira/browse/SPARK-36493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zikun updated SPARK-36493: -- Description: Currently we have the logic to deal with the JDBC keytab provided by the "--files" option {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}} \{{{}} {{}}{{val result = SparkFiles.get(keytabParam)}} {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on executor: $result")}} {{}}{{result}} {{}}} {{else {}} {{}}{{logDebug("Keytab path found, assuming manual upload")}} {{}}{{keytabParam}} {{}}} Spark has already created the soft link for any file submitted by the "--files" option. Here is an example. testusera1.keytab -> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab So there is no need to call the SparkFiles.get to absolute path of the keytab file. We can directly use the variable `keytabParam` as the keytab file path. Moreover, SparkFiles.get will get a wrong path of keytab for the driver in cluster mode. In cluster mode, the keytab is available at the following location for both the driver and executors {{/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/container_1628584679772_0030_01_01/testusera1.keytab}} but SparkFiles.get brings the following wrong location for the driver /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab was: Currently we have the logic to deal with the JDBC keytab provided by the "--files" option {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}} \{{{}} {{}}{{val result = SparkFiles.get(keytabParam)}} {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on executor: $result")}} {{}}{{result}} {{}}} {{else {}} {{}}{{logDebug("Keytab path found, assuming manual upload")}} {{}}{{keytabParam}} {{}}} Spark has already created the soft link for any file submitted by the "--files" option. Here is an example. testusera1.keytab -> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab So there is no need to call the SparkFiles.get to absolute path of the keytab file. We can directly use the variable `keytabParam` as the keytab file path. Moreover, SparkFiles.get will get a wrong path of keytab for the driver in cluster mode. In cluster mode, the keytab is distributed to the following location for both the driver and executors {{/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/container_1628584679772_0030_01_01/testusera1.keytab}} but SparkFiles.get brings the following wrong location for the driver /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab > Skip Retrieving keytab with SparkFiles.get if keytab found in the CWD of Yarn > Container > --- > > Key: SPARK-36493 > URL: https://issues.apache.org/jira/browse/SPARK-36493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.2 >Reporter: Zikun >Priority: Major > Fix For: 3.1.3 > > > Currently we have the logic to deal with the JDBC keytab provided by the > "--files" option > {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}} > \{{{}} > {{}}{{val result = SparkFiles.get(keytabParam)}} > {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on > executor: $result")}} > {{}}{{result}} > {{}}} {{else {}} > {{}}{{logDebug("Keytab path found, assuming manual upload")}} > {{}}{{keytabParam}} > {{}}} > Spark has already created the soft link for any file submitted by the > "--files" option. Here is an example. > testusera1.keytab -> > /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab > > So there is no need to call the SparkFiles.get to absolute path of the keytab > file. We can directly use the variable `keytabParam` as the keytab file path. > > Moreover, SparkFiles.get will get a wrong path of keytab for the driver in > cluster mode. In cluster mode, the keytab is available at the following > location for both the driver and executors > {{/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/container_1628584679772_0030_01_01/testusera1.keytab}} > bu
[jira] [Resolved] (SPARK-36535) refine the sql reference doc
[ https://issues.apache.org/jira/browse/SPARK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-36535. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33767 [https://github.com/apache/spark/pull/33767] > refine the sql reference doc > > > Key: SPARK-36535 > URL: https://issues.apache.org/jira/browse/SPARK-36535 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36535) refine the sql reference doc
[ https://issues.apache.org/jira/browse/SPARK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-36535: - Assignee: Wenchen Fan > refine the sql reference doc > > > Key: SPARK-36535 > URL: https://issues.apache.org/jira/browse/SPARK-36535 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36537) Take care of other tests disabled related to inplace updates with CategoricalDtype.
[ https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-36537: -- Description: There are some more tests disabled related to inplace updates with CategoricalDtype. They seem like pandas' bugs or not maintained anymore because inplace updates with CategoricalDtype are deprecated. was:There are some more tests disabled with a marker {{TODO(SPARK-36367)}}. > Take care of other tests disabled related to inplace updates with > CategoricalDtype. > --- > > Key: SPARK-36537 > URL: https://issues.apache.org/jira/browse/SPARK-36537 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > There are some more tests disabled related to inplace updates with > CategoricalDtype. > They seem like pandas' bugs or not maintained anymore because inplace updates > with CategoricalDtype are deprecated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36537) Take care of other tests disabled related to inplace updates with CategoricalDtype.
[ https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-36537: -- Summary: Take care of other tests disabled related to inplace updates with CategoricalDtype. (was: Take care of other tests disabled.) > Take care of other tests disabled related to inplace updates with > CategoricalDtype. > --- > > Key: SPARK-36537 > URL: https://issues.apache.org/jira/browse/SPARK-36537 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > There are some more tests disabled with a marker {{TODO(SPARK-36367)}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36537) Take care of other tests disabled.
Takuya Ueshin created SPARK-36537: - Summary: Take care of other tests disabled. Key: SPARK-36537 URL: https://issues.apache.org/jira/browse/SPARK-36537 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin There are some more tests disabled with a marker {{TODO(SPARK-36367)}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.
[ https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400579#comment-17400579 ] Apache Spark commented on SPARK-35011: -- User 'sumeetgajjar' has created a pull request for this issue: https://github.com/apache/spark/pull/33771 > Avoid Block Manager registerations when StopExecutor msg is in-flight. > -- > > Key: SPARK-35011 > URL: https://issues.apache.org/jira/browse/SPARK-35011 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1, 3.2.0 >Reporter: Sumeet >Assignee: Sumeet >Priority: Major > Labels: BlockManager, core > Fix For: 3.2.0 > > > *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, > driver reports dead executors as alive. > *Problem:* > I was testing Dynamic Allocation on K8s with about 300 executors. While doing > so, when the executors were torn down due to > "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor > pods being removed from K8s, however, under the "Executors" tab in SparkUI, I > could see some executors listed as alive. > [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] > also returned a value greater than 1. > > *Cause:* > * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on > executorEndpoint > * "CoarseGrainedSchedulerBackend" removes that executor from Driver's > internal data structures and publishes "SparkListenerExecutorRemoved" on the > "listenerBus". > * Executor has still not processed "StopExecutor" from the Driver > * Driver receives heartbeat from the Executor, since it cannot find the > "executorId" in its data structures, it responds with > "HeartbeatResponse(reregisterBlockManager = true)" > * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" > and "SparkListenerBlockManagerAdded" is published on the "listenerBus" > * Executor starts processing the "StopExecutor" and exits > * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and > updates "AppStatusStore" > * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list > of executors which returns the dead executor as alive. > > *Proposed Solution:* > Maintain a Cache of recently removed executors on Driver. During the > registration in BlockManagerMasterEndpoint if the BlockManager belongs to a > recently removed executor, return None indicating the registration is ignored > since the executor will be shutting down soon. > On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed > executor, return true indicating the driver knows about it, thereby > preventing reregisteration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.
[ https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400578#comment-17400578 ] Apache Spark commented on SPARK-35011: -- User 'sumeetgajjar' has created a pull request for this issue: https://github.com/apache/spark/pull/33771 > Avoid Block Manager registerations when StopExecutor msg is in-flight. > -- > > Key: SPARK-35011 > URL: https://issues.apache.org/jira/browse/SPARK-35011 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1, 3.2.0 >Reporter: Sumeet >Assignee: Sumeet >Priority: Major > Labels: BlockManager, core > Fix For: 3.2.0 > > > *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, > driver reports dead executors as alive. > *Problem:* > I was testing Dynamic Allocation on K8s with about 300 executors. While doing > so, when the executors were torn down due to > "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor > pods being removed from K8s, however, under the "Executors" tab in SparkUI, I > could see some executors listed as alive. > [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] > also returned a value greater than 1. > > *Cause:* > * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on > executorEndpoint > * "CoarseGrainedSchedulerBackend" removes that executor from Driver's > internal data structures and publishes "SparkListenerExecutorRemoved" on the > "listenerBus". > * Executor has still not processed "StopExecutor" from the Driver > * Driver receives heartbeat from the Executor, since it cannot find the > "executorId" in its data structures, it responds with > "HeartbeatResponse(reregisterBlockManager = true)" > * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" > and "SparkListenerBlockManagerAdded" is published on the "listenerBus" > * Executor starts processing the "StopExecutor" and exits > * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and > updates "AppStatusStore" > * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list > of executors which returns the dead executor as alive. > > *Proposed Solution:* > Maintain a Cache of recently removed executors on Driver. During the > registration in BlockManagerMasterEndpoint if the BlockManager belongs to a > recently removed executor, return None indicating the registration is ignored > since the executor will be shutting down soon. > On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed > executor, return true indicating the driver knows about it, thereby > preventing reregisteration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down
[ https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400577#comment-17400577 ] Apache Spark commented on SPARK-34949: -- User 'sumeetgajjar' has created a pull request for this issue: https://github.com/apache/spark/pull/33770 > Executor.reportHeartBeat reregisters blockManager even when Executor is > shutting down > - > > Key: SPARK-34949 > URL: https://issues.apache.org/jira/browse/SPARK-34949 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1, 3.2.0 > Environment: Resource Manager: K8s >Reporter: Sumeet >Assignee: Sumeet >Priority: Major > Labels: Executor, heartbeat > Fix For: 3.1.2, 3.2.0 > > > *Problem:* > I was testing Dynamic Allocation on K8s with about 300 executors. While doing > so, when the executors were torn down due to > "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor > pods being removed from K8s, however, under the "Executors" tab in SparkUI, I > could see some executors listed as alive. > [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] > also returned a value greater than 1. > > *Cause:* > * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a > "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the > "listenerBus" > * "CoarseGrainedExecutorBackend" starts the executor shutdown > * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and > removes the executor from "executorLastSeen" > * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" > cannot find the "executorId" in "executorLastSeen" and hence responds with > "HeartbeatResponse(reregisterBlockManager = true)" > * The Executor now calls "env.blockManager.reregister()" and reregisters > itself thus creating inconsistency > > *Proposed Solution:* > The "reportHeartBeat" method is not aware of the fact that Executor is > shutting down, it should check "executorShutdown" before reregistering. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36536) Split the JSON/CSV option of datetime format to in read and in write
[ https://issues.apache.org/jira/browse/SPARK-36536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36536: Assignee: Apache Spark (was: Max Gekk) > Split the JSON/CSV option of datetime format to in read and in write > > > Key: SPARK-36536 > URL: https://issues.apache.org/jira/browse/SPARK-36536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > This is a follow up of https://issues.apache.org/jira/browse/SPARK-36418. > Need to split JSON and CSV options *dateFormat* and *timestampFormat*. In > write, should be the same but in read the option shouldn't be set to a > default value. In this way, DateFormatter and TimestampFormatter will use the > CAST logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36536) Split the JSON/CSV option of datetime format to in read and in write
[ https://issues.apache.org/jira/browse/SPARK-36536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36536: Assignee: Max Gekk (was: Apache Spark) > Split the JSON/CSV option of datetime format to in read and in write > > > Key: SPARK-36536 > URL: https://issues.apache.org/jira/browse/SPARK-36536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > This is a follow up of https://issues.apache.org/jira/browse/SPARK-36418. > Need to split JSON and CSV options *dateFormat* and *timestampFormat*. In > write, should be the same but in read the option shouldn't be set to a > default value. In this way, DateFormatter and TimestampFormatter will use the > CAST logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36536) Split the JSON/CSV option of datetime format to in read and in write
[ https://issues.apache.org/jira/browse/SPARK-36536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400574#comment-17400574 ] Apache Spark commented on SPARK-36536: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/33769 > Split the JSON/CSV option of datetime format to in read and in write > > > Key: SPARK-36536 > URL: https://issues.apache.org/jira/browse/SPARK-36536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > This is a follow up of https://issues.apache.org/jira/browse/SPARK-36418. > Need to split JSON and CSV options *dateFormat* and *timestampFormat*. In > write, should be the same but in read the option shouldn't be set to a > default value. In this way, DateFormatter and TimestampFormatter will use the > CAST logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36536) Split the JSON/CSV option of datetime format to in read and in write
Max Gekk created SPARK-36536: Summary: Split the JSON/CSV option of datetime format to in read and in write Key: SPARK-36536 URL: https://issues.apache.org/jira/browse/SPARK-36536 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Max Gekk Assignee: Max Gekk This is a follow up of https://issues.apache.org/jira/browse/SPARK-36418. Need to split JSON and CSV options *dateFormat* and *timestampFormat*. In write, should be the same but in read the option shouldn't be set to a default value. In this way, DateFormatter and TimestampFormatter will use the CAST logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36370) Avoid using SelectionMixin._builtin_table which is removed in pandas 1.3
[ https://issues.apache.org/jira/browse/SPARK-36370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400544#comment-17400544 ] Apache Spark commented on SPARK-36370: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/33768 > Avoid using SelectionMixin._builtin_table which is removed in pandas 1.3 > > > Key: SPARK-36370 > URL: https://issues.apache.org/jira/browse/SPARK-36370 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36387) Fix Series.astype from datetime to nullable string
[ https://issues.apache.org/jira/browse/SPARK-36387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-36387. --- Fix Version/s: 3.3.0 Assignee: Haejoon Lee Resolution: Fixed Issue resolved by pull request 33735 https://github.com/apache/spark/pull/33735 > Fix Series.astype from datetime to nullable string > -- > > Key: SPARK-36387 > URL: https://issues.apache.org/jira/browse/SPARK-36387 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.3.0 > > Attachments: image-2021-08-12-14-24-31-321.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36535) refine the sql reference doc
[ https://issues.apache.org/jira/browse/SPARK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36535: Assignee: (was: Apache Spark) > refine the sql reference doc > > > Key: SPARK-36535 > URL: https://issues.apache.org/jira/browse/SPARK-36535 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36535) refine the sql reference doc
[ https://issues.apache.org/jira/browse/SPARK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400499#comment-17400499 ] Apache Spark commented on SPARK-36535: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/33767 > refine the sql reference doc > > > Key: SPARK-36535 > URL: https://issues.apache.org/jira/browse/SPARK-36535 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36535) refine the sql reference doc
[ https://issues.apache.org/jira/browse/SPARK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36535: Assignee: Apache Spark > refine the sql reference doc > > > Key: SPARK-36535 > URL: https://issues.apache.org/jira/browse/SPARK-36535 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36535) refine the sql reference doc
Wenchen Fan created SPARK-36535: --- Summary: refine the sql reference doc Key: SPARK-36535 URL: https://issues.apache.org/jira/browse/SPARK-36535 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.2.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27442) ParquetFileFormat fails to read column named with invalid characters
[ https://issues.apache.org/jira/browse/SPARK-27442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400478#comment-17400478 ] Dror Speiser commented on SPARK-27442: -- Hey, I'm going over the parquet format specification (github page and thrift file), and I don't see any mention of valid or invalid characters for field names in schema elements. Was this a restriction in earlier format specifications? > ParquetFileFormat fails to read column named with invalid characters > > > Key: SPARK-27442 > URL: https://issues.apache.org/jira/browse/SPARK-27442 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0, 2.4.1 >Reporter: Jan Vršovský >Priority: Minor > > When reading a parquet file which contains characters considered invalid, the > reader fails with exception: > Name: org.apache.spark.sql.AnalysisException > Message: Attribute name "..." contains invalid character(s) among " > ,;{}()\n\t=". Please use alias to rename it. > Spark should not be able to write such files, but it should be able to read > it (and allow the user to correct it). However, possible workarounds (such as > using alias to rename the column, or forcing another schema) do not work, > since the check is done on the input. > (Possible fix: remove superficial > {{ParquetWriteSupport.setSchema(requiredSchema, hadoopConf)}} from > {{buildReaderWithPartitionValues}} ?) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36352) Spark should check result plan's output schema name
[ https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400445#comment-17400445 ] Apache Spark commented on SPARK-36352: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33764 > Spark should check result plan's output schema name > --- > > Key: SPARK-36352 > URL: https://issues.apache.org/jira/browse/SPARK-36352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36352) Spark should check result plan's output schema name
[ https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400444#comment-17400444 ] Apache Spark commented on SPARK-36352: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33764 > Spark should check result plan's output schema name > --- > > Key: SPARK-36352 > URL: https://issues.apache.org/jira/browse/SPARK-36352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36052) Introduce pending pod limit for Spark on K8s
[ https://issues.apache.org/jira/browse/SPARK-36052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36052: -- Labels: releasenotes (was: ) > Introduce pending pod limit for Spark on K8s > > > Key: SPARK-36052 > URL: https://issues.apache.org/jira/browse/SPARK-36052 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Labels: releasenotes > Fix For: 3.2.0, 3.3.0 > > > Introduce a new configuration to limit the number of pending PODs for Spark > on K8S as the K8S scheduler could be overloaded with requests which slows > down the resource allocations (especially in case of dynamic allocation). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23693) SQL function uuid()
[ https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400411#comment-17400411 ] Jean Georges Perrin commented on SPARK-23693: - [~rxin] - You could require a parameter to the function this should make it deterministic. > SQL function uuid() > --- > > Key: SPARK-23693 > URL: https://issues.apache.org/jira/browse/SPARK-23693 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Arseniy Tashoyan >Priority: Minor > > Add function uuid() to org.apache.spark.sql.functions that returns > [Universally Unique > ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. > Sometimes it is necessary to uniquely identify each row in a DataFrame. > Currently the following ways are available: > * monotonically_increasing_id() function > * row_number() function over some window > * convert the DataFrame to RDD and zipWithIndex() > All these approaches do not work when appending this DataFrame to another > DataFrame (union). Collisions may occur - two rows in different DataFrames > may have the same ID. Re-generating IDs on the resulting DataFrame is not an > option, because some data in some other system may already refer to old IDs. > The proposed solution is to add new function: > {code:scala} > def uuid(): Column > {code} > that returns String representation of UUID. > UUID is represented as a 128-bit number (two long numbers). Such numbers are > not supported in Scala or Java. In addition, some storage systems do not > support 128-bit numbers (Parquet's largest numeric type is INT96). This is > the reason for the uuid() function to return String. > I already have a simple implementation based on > [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I > can share it as a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36533) Allow streaming queries with Trigger.Once run in multiple batches
[ https://issues.apache.org/jira/browse/SPARK-36533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400382#comment-17400382 ] Apache Spark commented on SPARK-36533: -- User 'bozhang2820' has created a pull request for this issue: https://github.com/apache/spark/pull/33763 > Allow streaming queries with Trigger.Once run in multiple batches > - > > Key: SPARK-36533 > URL: https://issues.apache.org/jira/browse/SPARK-36533 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Bo Zhang >Priority: Major > > Currently streaming queries with Trigger.Once will always load all of the > available data in a single batch. Because of this, the amount of data the > queries can process is limited, or Spark driver will be out of memory. > We should allow streaming queries with Trigger.Once run in multiple batches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36533) Allow streaming queries with Trigger.Once run in multiple batches
[ https://issues.apache.org/jira/browse/SPARK-36533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36533: Assignee: Apache Spark > Allow streaming queries with Trigger.Once run in multiple batches > - > > Key: SPARK-36533 > URL: https://issues.apache.org/jira/browse/SPARK-36533 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Bo Zhang >Assignee: Apache Spark >Priority: Major > > Currently streaming queries with Trigger.Once will always load all of the > available data in a single batch. Because of this, the amount of data the > queries can process is limited, or Spark driver will be out of memory. > We should allow streaming queries with Trigger.Once run in multiple batches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36533) Allow streaming queries with Trigger.Once run in multiple batches
[ https://issues.apache.org/jira/browse/SPARK-36533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36533: Assignee: (was: Apache Spark) > Allow streaming queries with Trigger.Once run in multiple batches > - > > Key: SPARK-36533 > URL: https://issues.apache.org/jira/browse/SPARK-36533 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Bo Zhang >Priority: Major > > Currently streaming queries with Trigger.Once will always load all of the > available data in a single batch. Because of this, the amount of data the > queries can process is limited, or Spark driver will be out of memory. > We should allow streaming queries with Trigger.Once run in multiple batches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36493) Skip Retrieving keytab with SparkFiles.get if keytab found in the CWD of Yarn Container
[ https://issues.apache.org/jira/browse/SPARK-36493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zikun updated SPARK-36493: -- Summary: Skip Retrieving keytab with SparkFiles.get if keytab found in the CWD of Yarn Container (was: SparkFiles.get is not needed for the JDBC keytab provided by the "--files" option) > Skip Retrieving keytab with SparkFiles.get if keytab found in the CWD of Yarn > Container > --- > > Key: SPARK-36493 > URL: https://issues.apache.org/jira/browse/SPARK-36493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.2 >Reporter: Zikun >Priority: Major > Fix For: 3.1.3 > > > Currently we have the logic to deal with the JDBC keytab provided by the > "--files" option > {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}} > \{{{}} > {{}}{{val result = SparkFiles.get(keytabParam)}} > {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on > executor: $result")}} > {{}}{{result}} > {{}}} {{else {}} > {{}}{{logDebug("Keytab path found, assuming manual upload")}} > {{}}{{keytabParam}} > {{}}} > Spark has already created the soft link for any file submitted by the > "--files" option. Here is an example. > testusera1.keytab -> > /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab > > So there is no need to call the SparkFiles.get to absolute path of the keytab > file. We can directly use the variable `keytabParam` as the keytab file path. > > Moreover, SparkFiles.get will get a wrong path of keytab for the driver in > cluster mode. In cluster mode, the keytab is distributed to the following > location for both the driver and executors > {{/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/container_1628584679772_0030_01_01/testusera1.keytab}} > but SparkFiles.get brings the following wrong location for the driver > /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36493) SparkFiles.get is not needed for the JDBC keytab provided by the "--files" option
[ https://issues.apache.org/jira/browse/SPARK-36493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zikun updated SPARK-36493: -- Description: Currently we have the logic to deal with the JDBC keytab provided by the "--files" option {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}} \{{{}} {{}}{{val result = SparkFiles.get(keytabParam)}} {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on executor: $result")}} {{}}{{result}} {{}}} {{else {}} {{}}{{logDebug("Keytab path found, assuming manual upload")}} {{}}{{keytabParam}} {{}}} Spark has already created the soft link for any file submitted by the "--files" option. Here is an example. testusera1.keytab -> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab So there is no need to call the SparkFiles.get to absolute path of the keytab file. We can directly use the variable `keytabParam` as the keytab file path. Moreover, SparkFiles.get will get a wrong path of keytab for the driver in cluster mode. In cluster mode, the keytab is distributed to the following location for both the driver and executors {{/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/container_1628584679772_0030_01_01/testusera1.keytab}} but SparkFiles.get brings the following wrong location for the driver /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab was: Currently we have the logic to deal with the JDBC keytab provided by the "--files" option {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}} {{{}} {{}}{{val result = SparkFiles.get(keytabParam)}} {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on executor: $result")}} {{}}{{result}} {{}}} {{else {}} {{}}{{logDebug("Keytab path found, assuming manual upload")}} {{}}{{keytabParam}} {{}}} Spark has already created the soft link for any file submitted by the "--files" option. Here is an example. testusera1.keytab -> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab So there is no need to call the SparkFiles.get to absolute path of the keytab file. We can directly use the variable `keytabParam` as the keytab file path. Moreover, SparkFiles.get will get a wrong path of keytab for the driver in cluster mode. In cluster mode, the keytab is distributed to the following location for both the driver and executors /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab but SparkFiles.get brings the following wrong location for the driver /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab > SparkFiles.get is not needed for the JDBC keytab provided by the "--files" > option > - > > Key: SPARK-36493 > URL: https://issues.apache.org/jira/browse/SPARK-36493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.2 >Reporter: Zikun >Priority: Major > Fix For: 3.1.3 > > > Currently we have the logic to deal with the JDBC keytab provided by the > "--files" option > {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}} > \{{{}} > {{}}{{val result = SparkFiles.get(keytabParam)}} > {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on > executor: $result")}} > {{}}{{result}} > {{}}} {{else {}} > {{}}{{logDebug("Keytab path found, assuming manual upload")}} > {{}}{{keytabParam}} > {{}}} > Spark has already created the soft link for any file submitted by the > "--files" option. Here is an example. > testusera1.keytab -> > /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab > > So there is no need to call the SparkFiles.get to absolute path of the keytab > file. We can directly use the variable `keytabParam` as the keytab file path. > > Moreover, SparkFiles.get will get a wrong path of keytab for the driver in > cluster mode. In cluster mode, the keytab is distributed to the following > location for both the driver and executors > {{/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/container_1628584679772_0030_01_01/testusera1.keytab}} > but SparkFiles.get brings the following wrong loc
[jira] [Commented] (SPARK-35028) ANSI mode: disallow group by aliases
[ https://issues.apache.org/jira/browse/SPARK-35028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400357#comment-17400357 ] Gengliang Wang commented on SPARK-35028: This is reverted in https://github.com/apache/spark/pull/33758 > ANSI mode: disallow group by aliases > > > Key: SPARK-35028 > URL: https://issues.apache.org/jira/browse/SPARK-35028 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.2.0 > > > As per the ANSI SQL standard secion 7.12 : > bq. Each shall unambiguously reference a column > of the table resulting from the . A column referenced in a > is a grouping column. > By forbidding it, we can avoid ambiguous SQL queries like: > SELECT col + 1 as col FROM t GROUP BY col -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36493) SparkFiles.get is not needed for the JDBC keytab provided by the "--files" option
[ https://issues.apache.org/jira/browse/SPARK-36493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zikun updated SPARK-36493: -- Description: Currently we have the logic to deal with the JDBC keytab provided by the "--files" option {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}} {{{}} {{}}{{val result = SparkFiles.get(keytabParam)}} {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on executor: $result")}} {{}}{{result}} {{}}} {{else {}} {{}}{{logDebug("Keytab path found, assuming manual upload")}} {{}}{{keytabParam}} {{}}} Spark has already created the soft link for any file submitted by the "--files" option. Here is an example. testusera1.keytab -> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab So there is no need to call the SparkFiles.get to absolute path of the keytab file. We can directly use the variable `keytabParam` as the keytab file path. Moreover, SparkFiles.get will get a wrong path of keytab for the driver in cluster mode. In cluster mode, the keytab is distributed to the following location for both the driver and executors /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab but SparkFiles.get brings the following wrong location for the driver /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab was: Currently we have the logic to deal with the JDBC keytab provided by the "--files" option if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty) { val result = SparkFiles.get(keytabParam) logDebug(s"Keytab path not found, assuming --files, file name used on executor: $result") result } Spark has already created the soft link for any file submitted by the "--files" option. Here is an example. testusera1.keytab -> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab So there is no need to call the SparkFiles.get to absolute path of the keytab file. We can directly use the variable `keytabParam` as the keytab file path. Moreover, SparkFiles.get will get a wrong path of keytab. In a running Spark cluster, the keytab is distributed to the following location /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab but SparkFiles.get brings the following wrong location /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab > SparkFiles.get is not needed for the JDBC keytab provided by the "--files" > option > - > > Key: SPARK-36493 > URL: https://issues.apache.org/jira/browse/SPARK-36493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.2 >Reporter: Zikun >Priority: Major > Fix For: 3.1.3 > > > Currently we have the logic to deal with the JDBC keytab provided by the > "--files" option > {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}} > {{{}} > {{}}{{val result = SparkFiles.get(keytabParam)}} > {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on > executor: $result")}} > {{}}{{result}} > {{}}} {{else {}} > {{}}{{logDebug("Keytab path found, assuming manual upload")}} > {{}}{{keytabParam}} > {{}}} > Spark has already created the soft link for any file submitted by the > "--files" option. Here is an example. > testusera1.keytab -> > /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab > > So there is no need to call the SparkFiles.get to absolute path of the keytab > file. We can directly use the variable `keytabParam` as the keytab file path. > > Moreover, SparkFiles.get will get a wrong path of keytab for the driver in > cluster mode. In cluster mode, the keytab is distributed to the following > location for both the driver and executors > /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab > but SparkFiles.get brings the following wrong location for the driver > /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab > > -- This message was sen
[jira] [Updated] (SPARK-36534) No way to check If Spark Session is created successfully with No Exceptions and ready to execute Tasks
[ https://issues.apache.org/jira/browse/SPARK-36534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jahar updated SPARK-36534: -- Description: I am running Spark on Kubernetes in Client mode. Spark driver is spawned programmatically (No Spark-Submit). Below is the dummy code to set SparkSession with KubeApiServer as Master. {code:java} // code placeholder private static SparkSession getSparkSession() { mySparkSessionBuilder = SparkSession.builder() .master("k8s://http://:6443") .appName("spark-K8sDemo") .config("spark.kubernetes.container.image","spark:3.0") .appName("spark-K8sDemo") .config("spark.jars", "/tmp/jt/database-0.0.1-SNAPSHOT-jar-with-dependencies.jar") .config("spark.kubernetes.executor.podTemplateFile","/tmp/jt/sparkExecutorPodTemplate.yaml") .config("spark.kubernetes.container.image.pullPolicy","Always") .config("spark.kubernetes.namespace","my_namespace") .config("spark.driver.host", "spark-driver-example") .config("spark.driver.port", "29413") .config("spark.kubernetes.authenticate.driver.serviceAccountName","spark") .config("spark.extraListeners","K8sPoc.MyHealthCheckListener"); setAditionalConfig(); mySession= mySparkSessionBuilder.getOrCreate(); return mySession; } {code} Now the problem is that, in certain scenarios like if K8s master is not reachable or master URL is incorrect or spark.kubernetes.container.image config is missing then it throws below exceptions (*Exception 1* and *Exception 2* given below). These exceptions are never propagated to Spark Driver program which in turn makes Spark Application in stuck state forever. There should be a way to know via SparkSession or SparkContext object if Session was created successful without any such exceptions and can run SparkTasks?? I have looked at SparkSession, SparkContext API documentation and SparkListeners but didn't find any such way to check if SparkSession is ready to run the Tasks or if not then dont keep the Spark Application in hanging state rather return a proper error/warn message to calling API. *Exception 1: (If _spark.kubernetes.container.image_ config is missing:* {code:java} 21/08/16 16:27:07 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 21/08/16 16:27:07 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes. 21/08/16 16:27:07 ERROR Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 org.apache.spark.SparkException: Must specify the executor container image at org.apache.spark.deploy.k8s.features.BasicExecutorFeatureStep.$anonfun$executorContainerImage$1(BasicExecutorFeatureStep.scala:41) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.deploy.k8s.features.BasicExecutorFeatureStep.(BasicExecutorFeatureStep.scala:41) at org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:43) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$16(ExecutorPodsAllocator.scala:216) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:208) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$1(ExecutorPodsAllocator.scala:82) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$1$adapted(ExecutorPodsAllocator.scala:82) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$callSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:110) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber(ExecutorPodsSnapshotsStoreImpl.scala:107) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:71) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} {noformat} {noformat} *Exception 2: (If _K8s master_ is not reachable or w
[jira] [Created] (SPARK-36534) No way to check If Spark Session is created successfully with No Exceptions and ready to execute Tasks
Jahar created SPARK-36534: - Summary: No way to check If Spark Session is created successfully with No Exceptions and ready to execute Tasks Key: SPARK-36534 URL: https://issues.apache.org/jira/browse/SPARK-36534 Project: Spark Issue Type: Improvement Components: Java API, Kubernetes, Scheduler Affects Versions: 3.0.1 Environment: *Spark 3.0.1* Reporter: Jahar I am running Spark on Kubernetes in Client mode. Spark driver is spawned programmatically (No Spark-Submit). Below is the dummy code to set SparkSession with KubeApiServer as Master. {code:java} // code placeholder private static SparkSession getSparkSession() { mySparkSessionBuilder = SparkSession.builder() .master("k8s://http://:6443") .appName("spark-K8sDemo") .config("spark.kubernetes.container.image","spark:3.0") .appName("spark-K8sDemo") .config("spark.jars", "/tmp/jt/database-0.0.1-SNAPSHOT-jar-with-dependencies.jar") .config("spark.kubernetes.executor.podTemplateFile","/tmp/jt/sparkExecutorPodTemplate.yaml") .config("spark.kubernetes.container.image.pullPolicy","Always") .config("spark.kubernetes.namespace","my_namespace") .config("spark.driver.host", "spark-driver-example") .config("spark.driver.port", "29413") .config("spark.kubernetes.authenticate.driver.serviceAccountName","spark") .config("spark.extraListeners","K8sPoc.MyHealthCheckListener"); setAditionalConfig(); mySession= mySparkSessionBuilder.getOrCreate(); return mySession; } {code} Now the problem is that, in certain scenarios like if K8s master is not reachable or master URL is incorrect or spark.kubernetes.container.image config is missing then it throws below exceptions (*Exception 1* and *Exception 2* given below). These exceptions are never propagated to Spark Driver program which in turn makes Spark Application in stuck state forever. There should be a way to know via SparkSession or SparkContext object if Session was created successful without any such exceptions and can run SparkTasks?? I have looked at SparkSession, SparkContext API documentation and SparkListeners but didn't find any such way to check if SparkSession is ready to run the Tasks or if not then dont keep the Spark Application in hanging state rather return a proper error/warn message to calling API. *Exception 1: (If _spark.kubernetes.container.image_ config is missing:* {noformat} {noformat} {noformat} 21/08/16 16:27:07 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 21/08/16 16:27:07 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes. 21/08/16 16:27:07 ERROR Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 org.apache.spark.SparkException: Must specify the executor container image at org.apache.spark.deploy.k8s.features.BasicExecutorFeatureStep.$anonfun$executorContainerImage$1(BasicExecutorFeatureStep.scala:41) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.deploy.k8s.features.BasicExecutorFeatureStep.(BasicExecutorFeatureStep.scala:41) at org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:43) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$16(ExecutorPodsAllocator.scala:216) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:208) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$1(ExecutorPodsAllocator.scala:82) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$1$adapted(ExecutorPodsAllocator.scala:82) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$callSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:110) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber(ExecutorPodsSnapshotsStoreImpl.scala:107) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:71) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Schedul
[jira] [Commented] (SPARK-36379) Null at root level of a JSON array causes the parsing failure (w/ permissive mode)
[ https://issues.apache.org/jira/browse/SPARK-36379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400314#comment-17400314 ] Apache Spark commented on SPARK-36379: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/33762 > Null at root level of a JSON array causes the parsing failure (w/ permissive > mode) > -- > > Key: SPARK-36379 > URL: https://issues.apache.org/jira/browse/SPARK-36379 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.2.0, 3.3.0 > > > {code} > scala> spark.read.json(Seq("""[{"a": "str"}, null, {"a": > "str"}]""").toDS).collect() > {code} > {code} > ... > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 > (TID 1) (172.30.3.20 executor driver): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > {code} > Since the mode (by default) is permissive, we shouldn't just fail like above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35426) When addMergerLocation exceed the maxRetainedMergerLocations , we should remove the merger based on merged shuffle data size.
[ https://issues.apache.org/jira/browse/SPARK-35426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated SPARK-35426: --- Description: Now When addMergerLocation exceed the maxRetainedMergerLocations , we just remove the oldest merger, but we'd better remove the merger based on merged shuffle data size. We should remove mergers with the largest amount of merged shuffle data, so that the remaining mergers have potentially more disk space to store new merged shuffle data was: Now When addMergerLocation exceed the maxRetainedMergerLocations , we just remove the oldest merger, but we'd better remove the merger based on merged shuffle data size. > When addMergerLocation exceed the maxRetainedMergerLocations , we should > remove the merger based on merged shuffle data size. > - > > Key: SPARK-35426 > URL: https://issues.apache.org/jira/browse/SPARK-35426 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Qi Zhu >Priority: Major > > Now When addMergerLocation exceed the maxRetainedMergerLocations , we just > remove the oldest merger, but we'd better remove the merger based on merged > shuffle data size. > We should remove mergers with the largest amount of merged shuffle data, so > that the remaining mergers have potentially more disk space to store new > merged shuffle data -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35426) When addMergerLocation exceed the maxRetainedMergerLocations , we should remove the merger based on merged shuffle data size.
[ https://issues.apache.org/jira/browse/SPARK-35426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated SPARK-35426: --- Description: Now When addMergerLocation exceed the maxRetainedMergerLocations , we just remove the oldest merger, but we'd better remove the merger based on merged shuffle data size. was: Now When addMergerLocation exceed the maxRetainedMergerLocations , we just remove the oldest merger, but we'd better remove the merger based on merged shuffle data size. The oldest merger may have big merged shuffle data size, it will not be a good choice to do so. > When addMergerLocation exceed the maxRetainedMergerLocations , we should > remove the merger based on merged shuffle data size. > - > > Key: SPARK-35426 > URL: https://issues.apache.org/jira/browse/SPARK-35426 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Qi Zhu >Priority: Major > > Now When addMergerLocation exceed the maxRetainedMergerLocations , we just > remove the oldest merger, but we'd better remove the merger based on merged > shuffle data size. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36533) Allow streaming queries with Trigger.Once run in multiple batches
Bo Zhang created SPARK-36533: Summary: Allow streaming queries with Trigger.Once run in multiple batches Key: SPARK-36533 URL: https://issues.apache.org/jira/browse/SPARK-36533 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 3.2.0 Reporter: Bo Zhang Currently streaming queries with Trigger.Once will always load all of the available data in a single batch. Because of this, the amount of data the queries can process is limited, or Spark driver will be out of memory. We should allow streaming queries with Trigger.Once run in multiple batches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36524) Add common class/trait for ANSI interval types
[ https://issues.apache.org/jira/browse/SPARK-36524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-36524. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33753 [https://github.com/apache/spark/pull/33753] > Add common class/trait for ANSI interval types > -- > > Key: SPARK-36524 > URL: https://issues.apache.org/jira/browse/SPARK-36524 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.2.0 > > > Currently, there are many places where we check both YearMonthIntervalType > and DayTimeIntervalType in the same match case, like > {code:scala} > case _: YearMonthIntervalType | _: DayTimeIntervalType => false > {code} > Need to add new trait or abstract class that should be extended by > YearMonthIntervalType and DayTimeIntervalType. So, we can transform the code > above to: > {code:scala} > case _: AnsiIntervalType => false > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36518) spark should support distribute directory to cluster
[ https://issues.apache.org/jira/browse/SPARK-36518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36518: Assignee: Apache Spark > spark should support distribute directory to cluster > > > Key: SPARK-36518 > URL: https://issues.apache.org/jira/browse/SPARK-36518 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.1.0, 3.1.1, 3.1.2 >Reporter: YuanGuanhu >Assignee: Apache Spark >Priority: Major > > Spark now only supports distribute files to cluster, but in some scenario, we > need upload a directory to cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36518) spark should support distribute directory to cluster
[ https://issues.apache.org/jira/browse/SPARK-36518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36518: Assignee: (was: Apache Spark) > spark should support distribute directory to cluster > > > Key: SPARK-36518 > URL: https://issues.apache.org/jira/browse/SPARK-36518 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.1.0, 3.1.1, 3.1.2 >Reporter: YuanGuanhu >Priority: Major > > Spark now only supports distribute files to cluster, but in some scenario, we > need upload a directory to cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36518) spark should support distribute directory to cluster
[ https://issues.apache.org/jira/browse/SPARK-36518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400231#comment-17400231 ] Apache Spark commented on SPARK-36518: -- User 'fhygh' has created a pull request for this issue: https://github.com/apache/spark/pull/33760 > spark should support distribute directory to cluster > > > Key: SPARK-36518 > URL: https://issues.apache.org/jira/browse/SPARK-36518 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.1.0, 3.1.1, 3.1.2 >Reporter: YuanGuanhu >Priority: Major > > Spark now only supports distribute files to cluster, but in some scenario, we > need upload a directory to cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36532) Deadlock in CoarseGrainedExecutorBackend.onDisconnected
[ https://issues.apache.org/jira/browse/SPARK-36532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36532: Assignee: (was: Apache Spark) > Deadlock in CoarseGrainedExecutorBackend.onDisconnected > --- > > Key: SPARK-36532 > URL: https://issues.apache.org/jira/browse/SPARK-36532 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: wuyi >Priority: Major > > The deadlock has the exactly same root cause as SPARK-14180 but just happens > in a different code path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36532) Deadlock in CoarseGrainedExecutorBackend.onDisconnected
[ https://issues.apache.org/jira/browse/SPARK-36532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36532: Assignee: Apache Spark > Deadlock in CoarseGrainedExecutorBackend.onDisconnected > --- > > Key: SPARK-36532 > URL: https://issues.apache.org/jira/browse/SPARK-36532 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > The deadlock has the exactly same root cause as SPARK-14180 but just happens > in a different code path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36532) Deadlock in CoarseGrainedExecutorBackend.onDisconnected
[ https://issues.apache.org/jira/browse/SPARK-36532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400188#comment-17400188 ] Apache Spark commented on SPARK-36532: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/33759 > Deadlock in CoarseGrainedExecutorBackend.onDisconnected > --- > > Key: SPARK-36532 > URL: https://issues.apache.org/jira/browse/SPARK-36532 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: wuyi >Priority: Major > > The deadlock has the exactly same root cause as SPARK-14180 but just happens > in a different code path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org