date:20210817

[jira] [Commented] (SPARK-36539) trimNonTopLevelAlias should not change StructType inner alias

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400818#comment-17400818
 ] 

Apache Spark commented on SPARK-36539:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33778

> trimNonTopLevelAlias should not change StructType inner alias
> -
>
> Key: SPARK-36539
> URL: https://issues.apache.org/jira/browse/SPARK-36539
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36539) trimNonTopLevelAlias should not change StructType inner alias

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36539:


Assignee: (was: Apache Spark)

> trimNonTopLevelAlias should not change StructType inner alias
> -
>
> Key: SPARK-36539
> URL: https://issues.apache.org/jira/browse/SPARK-36539
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36539) trimNonTopLevelAlias should not change StructType inner alias

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36539:


Assignee: Apache Spark

> trimNonTopLevelAlias should not change StructType inner alias
> -
>
> Key: SPARK-36539
> URL: https://issues.apache.org/jira/browse/SPARK-36539
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36539) trimNonTopLevelAlias should not change StructType inner alias

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400816#comment-17400816
 ] 

Apache Spark commented on SPARK-36539:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33778

> trimNonTopLevelAlias should not change StructType inner alias
> -
>
> Key: SPARK-36539
> URL: https://issues.apache.org/jira/browse/SPARK-36539
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36539) trimNonTopLevelAlias should not change StructType inner alias

2021-08-17 Thread angerszhu (Jira)

angerszhu created SPARK-36539:
-

 Summary: trimNonTopLevelAlias should not change StructType inner 
alias
 Key: SPARK-36539
 URL: https://issues.apache.org/jira/browse/SPARK-36539
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36465) Dynamic gap duration in session window

2021-08-17 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400805#comment-17400805
 ] 

L. C. Hsieh commented on SPARK-36465:
-

Thanks [~Gengliang.Wang]!

> Dynamic gap duration in session window
> --
>
> Key: SPARK-36465
> URL: https://issues.apache.org/jira/browse/SPARK-36465
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> The gap duration used in session window for now is a static value. To support 
> more complex usage, it is better to support dynamic gap duration which 
> determines the gap duration by looking at the current data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-36465) Dynamic gap duration in session window

2021-08-17 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400803#comment-17400803
 ] 

Gengliang Wang edited comment on SPARK-36465 at 8/18/21, 5:51 AM:
--

[~viirya][~kabhwan] FYI I converted this one as a sub-task of SPARK-10816.



was (Author: gengliang.wang):
[~viirya][~kabhwan]FYI I converted this one as a sub-task of SPARK-10816.

> Dynamic gap duration in session window
> --
>
> Key: SPARK-36465
> URL: https://issues.apache.org/jira/browse/SPARK-36465
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> The gap duration used in session window for now is a static value. To support 
> more complex usage, it is better to support dynamic gap duration which 
> determines the gap duration by looking at the current data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36465) Dynamic gap duration in session window

2021-08-17 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400803#comment-17400803
 ] 

Gengliang Wang commented on SPARK-36465:


[~viirya][~kabhwan]FYI I converted this one as a sub-task of SPARK-10816.

> Dynamic gap duration in session window
> --
>
> Key: SPARK-36465
> URL: https://issues.apache.org/jira/browse/SPARK-36465
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> The gap duration used in session window for now is a static value. To support 
> more complex usage, it is better to support dynamic gap duration which 
> determines the gap duration by looking at the current data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36465) Dynamic gap duration in session window

2021-08-17 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-36465:
---
Parent: SPARK-10816
Issue Type: Sub-task  (was: Improvement)

> Dynamic gap duration in session window
> --
>
> Key: SPARK-36465
> URL: https://issues.apache.org/jira/browse/SPARK-36465
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> The gap duration used in session window for now is a static value. To support 
> more complex usage, it is better to support dynamic gap duration which 
> determines the gap duration by looking at the current data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36538) Environment variables part in config doc isn't properly documented.

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400791#comment-17400791
 ] 

Apache Spark commented on SPARK-36538:
--

User 'yutoacts' has created a pull request for this issue:
https://github.com/apache/spark/pull/33777

> Environment variables part in config doc isn't properly documented.
> ---
>
> Key: SPARK-36538
> URL: https://issues.apache.org/jira/browse/SPARK-36538
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Priority: Major
>
> It says environment variables are not reflected through spark-env.sh in YARN 
> cluster mode but I believe they are. I think this part of the document should 
> be removed.
> [https://spark.apache.org/docs/latest/configuration.html#environment-variables]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36538) Environment variables part in config doc isn't properly documented.

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36538:


Assignee: Apache Spark

> Environment variables part in config doc isn't properly documented.
> ---
>
> Key: SPARK-36538
> URL: https://issues.apache.org/jira/browse/SPARK-36538
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Assignee: Apache Spark
>Priority: Major
>
> It says environment variables are not reflected through spark-env.sh in YARN 
> cluster mode but I believe they are. I think this part of the document should 
> be removed.
> [https://spark.apache.org/docs/latest/configuration.html#environment-variables]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36538) Environment variables part in config doc isn't properly documented.

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36538:


Assignee: (was: Apache Spark)

> Environment variables part in config doc isn't properly documented.
> ---
>
> Key: SPARK-36538
> URL: https://issues.apache.org/jira/browse/SPARK-36538
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Priority: Major
>
> It says environment variables are not reflected through spark-env.sh in YARN 
> cluster mode but I believe they are. I think this part of the document should 
> be removed.
> [https://spark.apache.org/docs/latest/configuration.html#environment-variables]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36386) Fix DataFrame groupby-expanding to follow pandas 1.3

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400777#comment-17400777
 ] 

Apache Spark commented on SPARK-36386:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33776

> Fix DataFrame groupby-expanding to follow pandas 1.3
> 
>
> Key: SPARK-36386
> URL: https://issues.apache.org/jira/browse/SPARK-36386
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36388) Fix DataFrame groupby-rolling to follow pandas 1.3

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400774#comment-17400774
 ] 

Apache Spark commented on SPARK-36388:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33776

> Fix DataFrame groupby-rolling to follow pandas 1.3
> --
>
> Key: SPARK-36388
> URL: https://issues.apache.org/jira/browse/SPARK-36388
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36386) Fix DataFrame groupby-expanding to follow pandas 1.3

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400775#comment-17400775
 ] 

Apache Spark commented on SPARK-36386:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33776

> Fix DataFrame groupby-expanding to follow pandas 1.3
> 
>
> Key: SPARK-36386
> URL: https://issues.apache.org/jira/browse/SPARK-36386
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36398) Redact sensitive information in Spark Thrift Server log

2021-08-17 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-36398.

Fix Version/s: 3.1.3
   3.2.0
 Assignee: Kousuke Saruta
   Resolution: Fixed

> Redact sensitive information in Spark Thrift Server log
> ---
>
> Key: SPARK-36398
> URL: https://issues.apache.org/jira/browse/SPARK-36398
> Project: Spark
>  Issue Type: Bug
>  Components: Security, SQL
>Affects Versions: 3.1.2
>Reporter: Denis Krivenko
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> Spark Thrift Server logs query without sensitive information redaction in 
> [org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.scala|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L188]
> {code:scala}
>   override def runInternal(): Unit = {
> setState(OperationState.PENDING)
> logInfo(s"Submitting query '$statement' with $statementId")
> {code}
> Logs
> {code:sh}
> 21/08/03 20:49:46 INFO SparkExecuteStatementOperation: Submitting query 
> 'CREATE OR REPLACE TEMPORARY VIEW test_view
> USING org.apache.spark.sql.jdbc
> OPTIONS (
> url="jdbc:mysql://example.com:3306",
> driver="com.mysql.jdbc.Driver",
> dbtable="example.test",
> user="my_username",
> password="my_password"
> )' with 37e5d2cb-aa96-407e-b589-7cb212324100
> 21/08/03 20:49:46 INFO SparkExecuteStatementOperation: Running query with 
> 37e5d2cb-aa96-407e-b589-7cb212324100
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36400) Redact sensitive information in Spark Thrift Server UI

2021-08-17 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-36400.

Fix Version/s: 3.1.3
   3.2.0
 Assignee: Kousuke Saruta
   Resolution: Fixed

> Redact sensitive information in Spark Thrift Server UI
> --
>
> Key: SPARK-36400
> URL: https://issues.apache.org/jira/browse/SPARK-36400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: Denis Krivenko
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
> Attachments: SQL Statistics.png
>
>
> Spark UI displays sensitive information on "JDBC/ODBC Server" tab
> The reason of the issue is in 
> [org.apache.spark.sql.hive.thriftserver.ui.SqlStatsPagedTable|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L166]
>  class 
> [here|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L266-L268]
> {code:scala}
>   
> 
>   {info.statement}
> 
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36538) Environment variables part in config doc isn't properly documented.

2021-08-17 Thread Yuto Akutsu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuto Akutsu updated SPARK-36538:

Description: 
It says environment variables are not reflected through spark-env.sh in YARN 
cluster mode but I believe they are. I think this part of the document should 
be removed.

[https://spark.apache.org/docs/latest/configuration.html#environment-variables]

  was:
It says environment variables are not reflected through spark-env.sh in YARN 
cluster mode although they are. I think this part of the document should be 
removed.

https://spark.apache.org/docs/latest/configuration.html#environment-variables


> Environment variables part in config doc isn't properly documented.
> ---
>
> Key: SPARK-36538
> URL: https://issues.apache.org/jira/browse/SPARK-36538
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Priority: Major
>
> It says environment variables are not reflected through spark-env.sh in YARN 
> cluster mode but I believe they are. I think this part of the document should 
> be removed.
> [https://spark.apache.org/docs/latest/configuration.html#environment-variables]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36428) the 'seconds' parameter of 'make_timestamp' should accept integer type

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400740#comment-17400740
 ] 

Apache Spark commented on SPARK-36428:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/33775

> the 'seconds' parameter of 'make_timestamp' should accept integer type
> --
>
> Key: SPARK-36428
> URL: https://issues.apache.org/jira/browse/SPARK-36428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> With ANSI mode, {{SELECT make_timestamp(1, 1, 1, 1, 1, 1)}} fails, because 
> the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be 
> implicitly casted to DECIMAL(8,6) under ANSI mode.
> We should update the function {{make_timestamp}} to allow integer type 
> 'seconds' parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36538) Environment variables part in config doc isn't properly documented.

2021-08-17 Thread Yuto Akutsu (Jira)

Yuto Akutsu created SPARK-36538:
---

 Summary: Environment variables part in config doc isn't properly 
documented.
 Key: SPARK-36538
 URL: https://issues.apache.org/jira/browse/SPARK-36538
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.1.2
Reporter: Yuto Akutsu


It says environment variables are not reflected through spark-env.sh in YARN 
cluster mode although they are. I think this part of the document should be 
removed.

https://spark.apache.org/docs/latest/configuration.html#environment-variables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36519) Store the RocksDB format in the checkpoint for a streaming query

2021-08-17 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400736#comment-17400736
 ] 

Gengliang Wang commented on SPARK-36519:


[~zsxwing] FYI I am converting this one as sub-task of SPARK-34198

> Store the RocksDB format in the checkpoint for a streaming query
> 
>
> Key: SPARK-36519
> URL: https://issues.apache.org/jira/browse/SPARK-36519
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> RocksDB provides backward compatibility but it doesn't always provide forward 
> compatibility. It's better to store the RocksDB format version in the 
> checkpoint so that it would give us more information to provide the rollback 
> guarantee when we upgrade the RocksDB version that may introduce incompatible 
> change in a new Spark version.
> A typical case is when a user upgrades their query to a new Spark version, 
> and this new Spark version has a new RocksDB version which may use a new 
> format. But the user hits some bug and decide to rollback. But in the old 
> Spark version, the old RocksDB version cannot read the new format.
> In order to handle this case, we will write the RocksDB format version to the 
> checkpoint. When restarting from a checkpoint, we will force RocksDB to use 
> the format version stored in the checkpoint. This will ensure the user can 
> rollback their Spark version if needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36519) Store the RocksDB format in the checkpoint for a streaming query

2021-08-17 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-36519:
---
Parent: SPARK-34198
Issue Type: Sub-task  (was: Improvement)

> Store the RocksDB format in the checkpoint for a streaming query
> 
>
> Key: SPARK-36519
> URL: https://issues.apache.org/jira/browse/SPARK-36519
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> RocksDB provides backward compatibility but it doesn't always provide forward 
> compatibility. It's better to store the RocksDB format version in the 
> checkpoint so that it would give us more information to provide the rollback 
> guarantee when we upgrade the RocksDB version that may introduce incompatible 
> change in a new Spark version.
> A typical case is when a user upgrades their query to a new Spark version, 
> and this new Spark version has a new RocksDB version which may use a new 
> format. But the user hits some bug and decide to rollback. But in the old 
> Spark version, the old RocksDB version cannot read the new format.
> In order to handle this case, we will write the RocksDB format version to the 
> checkpoint. When restarting from a checkpoint, we will force RocksDB to use 
> the format version stored in the checkpoint. This will ensure the user can 
> rollback their Spark version if needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36303) Refactor fourteenth set of 20 query execution errors to use error classes

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400717#comment-17400717
 ] 

Apache Spark commented on SPARK-36303:
--

User 'dgd-contributor' has created a pull request for this issue:
https://github.com/apache/spark/pull/33774

> Refactor fourteenth set of 20 query execution errors to use error classes
> -
>
> Key: SPARK-36303
> URL: https://issues.apache.org/jira/browse/SPARK-36303
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the fourteenth set of 20.
> {code:java}
> cannotGetEventTimeWatermarkError
> cannotSetTimeoutTimestampError
> batchMetadataFileNotFoundError
> multiStreamingQueriesUsingPathConcurrentlyError
> addFilesWithAbsolutePathUnsupportedError
> microBatchUnsupportedByDataSourceError
> cannotExecuteStreamingRelationExecError
> invalidStreamingOutputModeError
> catalogPluginClassNotFoundError
> catalogPluginClassNotImplementedError
> catalogPluginClassNotFoundForCatalogError
> catalogFailToFindPublicNoArgConstructorError
> catalogFailToCallPublicNoArgConstructorError
> cannotInstantiateAbstractCatalogPluginClassError
> failedToInstantiateConstructorForCatalogError
> noSuchElementExceptionError
> noSuchElementExceptionError
> cannotMutateReadOnlySQLConfError
> cannotCloneOrCopyReadOnlySQLConfError
> cannotGetSQLConfInSchedulerEventLoopThreadError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36303) Refactor fourteenth set of 20 query execution errors to use error classes

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36303:


Assignee: Apache Spark

> Refactor fourteenth set of 20 query execution errors to use error classes
> -
>
> Key: SPARK-36303
> URL: https://issues.apache.org/jira/browse/SPARK-36303
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Assignee: Apache Spark
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the fourteenth set of 20.
> {code:java}
> cannotGetEventTimeWatermarkError
> cannotSetTimeoutTimestampError
> batchMetadataFileNotFoundError
> multiStreamingQueriesUsingPathConcurrentlyError
> addFilesWithAbsolutePathUnsupportedError
> microBatchUnsupportedByDataSourceError
> cannotExecuteStreamingRelationExecError
> invalidStreamingOutputModeError
> catalogPluginClassNotFoundError
> catalogPluginClassNotImplementedError
> catalogPluginClassNotFoundForCatalogError
> catalogFailToFindPublicNoArgConstructorError
> catalogFailToCallPublicNoArgConstructorError
> cannotInstantiateAbstractCatalogPluginClassError
> failedToInstantiateConstructorForCatalogError
> noSuchElementExceptionError
> noSuchElementExceptionError
> cannotMutateReadOnlySQLConfError
> cannotCloneOrCopyReadOnlySQLConfError
> cannotGetSQLConfInSchedulerEventLoopThreadError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36303) Refactor fourteenth set of 20 query execution errors to use error classes

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36303:


Assignee: (was: Apache Spark)

> Refactor fourteenth set of 20 query execution errors to use error classes
> -
>
> Key: SPARK-36303
> URL: https://issues.apache.org/jira/browse/SPARK-36303
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the fourteenth set of 20.
> {code:java}
> cannotGetEventTimeWatermarkError
> cannotSetTimeoutTimestampError
> batchMetadataFileNotFoundError
> multiStreamingQueriesUsingPathConcurrentlyError
> addFilesWithAbsolutePathUnsupportedError
> microBatchUnsupportedByDataSourceError
> cannotExecuteStreamingRelationExecError
> invalidStreamingOutputModeError
> catalogPluginClassNotFoundError
> catalogPluginClassNotImplementedError
> catalogPluginClassNotFoundForCatalogError
> catalogFailToFindPublicNoArgConstructorError
> catalogFailToCallPublicNoArgConstructorError
> cannotInstantiateAbstractCatalogPluginClassError
> failedToInstantiateConstructorForCatalogError
> noSuchElementExceptionError
> noSuchElementExceptionError
> cannotMutateReadOnlySQLConfError
> cannotCloneOrCopyReadOnlySQLConfError
> cannotGetSQLConfInSchedulerEventLoopThreadError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36303) Refactor fourteenth set of 20 query execution errors to use error classes

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400716#comment-17400716
 ] 

Apache Spark commented on SPARK-36303:
--

User 'dgd-contributor' has created a pull request for this issue:
https://github.com/apache/spark/pull/33774

> Refactor fourteenth set of 20 query execution errors to use error classes
> -
>
> Key: SPARK-36303
> URL: https://issues.apache.org/jira/browse/SPARK-36303
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the fourteenth set of 20.
> {code:java}
> cannotGetEventTimeWatermarkError
> cannotSetTimeoutTimestampError
> batchMetadataFileNotFoundError
> multiStreamingQueriesUsingPathConcurrentlyError
> addFilesWithAbsolutePathUnsupportedError
> microBatchUnsupportedByDataSourceError
> cannotExecuteStreamingRelationExecError
> invalidStreamingOutputModeError
> catalogPluginClassNotFoundError
> catalogPluginClassNotImplementedError
> catalogPluginClassNotFoundForCatalogError
> catalogFailToFindPublicNoArgConstructorError
> catalogFailToCallPublicNoArgConstructorError
> cannotInstantiateAbstractCatalogPluginClassError
> failedToInstantiateConstructorForCatalogError
> noSuchElementExceptionError
> noSuchElementExceptionError
> cannotMutateReadOnlySQLConfError
> cannotCloneOrCopyReadOnlySQLConfError
> cannotGetSQLConfInSchedulerEventLoopThreadError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400696#comment-17400696
 ] 

Apache Spark commented on SPARK-34309:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33772

> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400695#comment-17400695
 ] 

Apache Spark commented on SPARK-34309:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33772

> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34415) Use randomization as a possibly better technique than grid search in optimizing hyperparameters

2021-08-17 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400655#comment-17400655
 ] 

Sean R. Owen commented on SPARK-34415:
--

I agree; I think it was mostly as it makes it simple to extend and reuse the 
param grid builder rather than reimplement a fair bit more code that uses it. 
It isn't as useful as generation random samples each time. 

Hm, on a second look though, couldn't the new class override build() to 
generate a bunch of actually randomly-sampled combinations? that part is easy I 
think, but then the question is, how many combinations to return? that would 
need a new API somewhere.

You could argue this is a bit misleading as the caller may expect it to 
generate random samples not randomly generate the grid. Hm, I'm retroactively 
on the fence about it. Is it worth trying to redesign quickly for 3.2.0? maybe 
a small impl and API change can support what this might be expected to do. 
Leave it? revert?

> Use randomization as a possibly better technique than grid search in 
> optimizing hyperparameters
> ---
>
> Key: SPARK-34415
> URL: https://issues.apache.org/jira/browse/SPARK-34415
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 3.0.1
>Reporter: Phillip Henry
>Assignee: Phillip Henry
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.2.0
>
>
> Randomization can be a more effective techinique than a grid search in 
> finding optimal hyperparameters since min/max points can fall between the 
> grid lines and never be found. Randomisation is not so restricted although 
> the probability of finding minima/maxima is dependent on the number of 
> attempts.
> Alice Zheng has an accessible description on how this technique works at 
> [https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html]
> (Note that I have a PR for this work outstanding at 
> [https://github.com/apache/spark/pull/31535] )
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34415) Use randomization as a possibly better technique than grid search in optimizing hyperparameters

2021-08-17 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400644#comment-17400644
 ] 

Xiangrui Meng commented on SPARK-34415:
---

[~phenry] [~srowen] The implementation doesn't do uniform sampling of the 
hyper-parameter search space. Instead, it samples per params and then construct 
the cartesian product of all combinations. I think this would significantly 
reduce the effectiveness of the random search. Was it already discussed?

> Use randomization as a possibly better technique than grid search in 
> optimizing hyperparameters
> ---
>
> Key: SPARK-34415
> URL: https://issues.apache.org/jira/browse/SPARK-34415
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 3.0.1
>Reporter: Phillip Henry
>Assignee: Phillip Henry
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.2.0
>
>
> Randomization can be a more effective techinique than a grid search in 
> finding optimal hyperparameters since min/max points can fall between the 
> grid lines and never be found. Randomisation is not so restricted although 
> the probability of finding minima/maxima is dependent on the number of 
> attempts.
> Alice Zheng has an accessible description on how this technique works at 
> [https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html]
> (Note that I have a PR for this work outstanding at 
> [https://github.com/apache/spark/pull/31535] )
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36493) Skip Retrieving keytab with SparkFiles.get if keytab found in the CWD of Yarn Container

2021-08-17 Thread Zikun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zikun updated SPARK-36493:
--
Description: 
Currently we have the logic to deal with the JDBC keytab provided by the 
"--files" option

{{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}}
 \{{{}}
 {{}}{{val result = SparkFiles.get(keytabParam)}}
 {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on 
executor: $result")}}
 {{}}{{result}}
 {{}}} {{else {}}
 {{}}{{logDebug("Keytab path found, assuming manual upload")}}
 {{}}{{keytabParam}}
 {{}}}

Spark has already created the soft link for any file submitted by the "--files" 
option. Here is an example.

testusera1.keytab -> 
/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab

 

So there is no need to call the SparkFiles.get to absolute path of the keytab 
file. We can directly use the variable `keytabParam` as the keytab file path.

 

Moreover, SparkFiles.get will get a wrong path of keytab for the driver in 
cluster mode. In cluster mode, the keytab is available at the following 
location for both the driver and executors

{{/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/container_1628584679772_0030_01_01/testusera1.keytab}}

but SparkFiles.get brings the following wrong location for the driver

/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab

 

 

  was:
Currently we have the logic to deal with the JDBC keytab provided by the 
"--files" option

{{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}}
 \{{{}}
 {{}}{{val result = SparkFiles.get(keytabParam)}}
 {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on 
executor: $result")}}
 {{}}{{result}}
 {{}}} {{else {}}
 {{}}{{logDebug("Keytab path found, assuming manual upload")}}
 {{}}{{keytabParam}}
 {{}}}

Spark has already created the soft link for any file submitted by the "--files" 
option. Here is an example.

testusera1.keytab -> 
/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab

 

So there is no need to call the SparkFiles.get to absolute path of the keytab 
file. We can directly use the variable `keytabParam` as the keytab file path.

 

Moreover, SparkFiles.get will get a wrong path of keytab for the driver in 
cluster mode. In cluster mode, the keytab is distributed to the following 
location for both the driver and executors

{{/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/container_1628584679772_0030_01_01/testusera1.keytab}}

but SparkFiles.get brings the following wrong location for the driver

/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab

 

 


> Skip Retrieving keytab with SparkFiles.get if keytab found in the CWD of Yarn 
> Container
> ---
>
> Key: SPARK-36493
> URL: https://issues.apache.org/jira/browse/SPARK-36493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.2
>Reporter: Zikun
>Priority: Major
> Fix For: 3.1.3
>
>
> Currently we have the logic to deal with the JDBC keytab provided by the 
> "--files" option
> {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}}
>  \{{{}}
>  {{}}{{val result = SparkFiles.get(keytabParam)}}
>  {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on 
> executor: $result")}}
>  {{}}{{result}}
>  {{}}} {{else {}}
>  {{}}{{logDebug("Keytab path found, assuming manual upload")}}
>  {{}}{{keytabParam}}
>  {{}}}
> Spark has already created the soft link for any file submitted by the 
> "--files" option. Here is an example.
> testusera1.keytab -> 
> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab
>  
> So there is no need to call the SparkFiles.get to absolute path of the keytab 
> file. We can directly use the variable `keytabParam` as the keytab file path.
>  
> Moreover, SparkFiles.get will get a wrong path of keytab for the driver in 
> cluster mode. In cluster mode, the keytab is available at the following 
> location for both the driver and executors
> {{/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/container_1628584679772_0030_01_01/testusera1.keytab}}
> bu

[jira] [Resolved] (SPARK-36535) refine the sql reference doc

2021-08-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36535.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33767
[https://github.com/apache/spark/pull/33767]

> refine the sql reference doc
> 
>
> Key: SPARK-36535
> URL: https://issues.apache.org/jira/browse/SPARK-36535
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36535) refine the sql reference doc

2021-08-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36535:
-

Assignee: Wenchen Fan

> refine the sql reference doc
> 
>
> Key: SPARK-36535
> URL: https://issues.apache.org/jira/browse/SPARK-36535
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36537) Take care of other tests disabled related to inplace updates with CategoricalDtype.

2021-08-17 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-36537:
--
Description: 
There are some more tests disabled related to inplace updates with 
CategoricalDtype.
They seem like pandas' bugs or not maintained anymore because inplace updates 
with CategoricalDtype are deprecated.

  was:There are some more tests disabled with a marker {{TODO(SPARK-36367)}}.


> Take care of other tests disabled related to inplace updates with 
> CategoricalDtype.
> ---
>
> Key: SPARK-36537
> URL: https://issues.apache.org/jira/browse/SPARK-36537
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some more tests disabled related to inplace updates with 
> CategoricalDtype.
> They seem like pandas' bugs or not maintained anymore because inplace updates 
> with CategoricalDtype are deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36537) Take care of other tests disabled related to inplace updates with CategoricalDtype.

2021-08-17 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-36537:
--
Summary: Take care of other tests disabled related to inplace updates with 
CategoricalDtype.  (was: Take care of other tests disabled.)

> Take care of other tests disabled related to inplace updates with 
> CategoricalDtype.
> ---
>
> Key: SPARK-36537
> URL: https://issues.apache.org/jira/browse/SPARK-36537
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some more tests disabled with a marker {{TODO(SPARK-36367)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36537) Take care of other tests disabled.

2021-08-17 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-36537:
-

 Summary: Take care of other tests disabled.
 Key: SPARK-36537
 URL: https://issues.apache.org/jira/browse/SPARK-36537
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Takuya Ueshin


There are some more tests disabled with a marker {{TODO(SPARK-36367)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400579#comment-17400579
 ] 

Apache Spark commented on SPARK-35011:
--

User 'sumeetgajjar' has created a pull request for this issue:
https://github.com/apache/spark/pull/33771

> Avoid Block Manager registerations when StopExecutor msg is in-flight.
> --
>
> Key: SPARK-35011
> URL: https://issues.apache.org/jira/browse/SPARK-35011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Sumeet
>Assignee: Sumeet
>Priority: Major
>  Labels: BlockManager, core
> Fix For: 3.2.0
>
>
> *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, 
> driver reports dead executors as alive.
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on 
> executorEndpoint
>  * "CoarseGrainedSchedulerBackend" removes that executor from Driver's 
> internal data structures and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus".
>  * Executor has still not processed "StopExecutor" from the Driver
>  * Driver receives heartbeat from the Executor, since it cannot find the 
> "executorId" in its data structures, it responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" 
> and "SparkListenerBlockManagerAdded" is published on the "listenerBus"
>  * Executor starts processing the "StopExecutor" and exits
>  * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and 
> updates "AppStatusStore"
>  * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list 
> of executors which returns the dead executor as alive.
>  
> *Proposed Solution:*
> Maintain a Cache of recently removed executors on Driver. During the 
> registration in BlockManagerMasterEndpoint if the BlockManager belongs to a 
> recently removed executor, return None indicating the registration is ignored 
> since the executor will be shutting down soon.
> On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed 
> executor, return true indicating the driver knows about it, thereby 
> preventing reregisteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400578#comment-17400578
 ] 

Apache Spark commented on SPARK-35011:
--

User 'sumeetgajjar' has created a pull request for this issue:
https://github.com/apache/spark/pull/33771

> Avoid Block Manager registerations when StopExecutor msg is in-flight.
> --
>
> Key: SPARK-35011
> URL: https://issues.apache.org/jira/browse/SPARK-35011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Sumeet
>Assignee: Sumeet
>Priority: Major
>  Labels: BlockManager, core
> Fix For: 3.2.0
>
>
> *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, 
> driver reports dead executors as alive.
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on 
> executorEndpoint
>  * "CoarseGrainedSchedulerBackend" removes that executor from Driver's 
> internal data structures and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus".
>  * Executor has still not processed "StopExecutor" from the Driver
>  * Driver receives heartbeat from the Executor, since it cannot find the 
> "executorId" in its data structures, it responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" 
> and "SparkListenerBlockManagerAdded" is published on the "listenerBus"
>  * Executor starts processing the "StopExecutor" and exits
>  * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and 
> updates "AppStatusStore"
>  * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list 
> of executors which returns the dead executor as alive.
>  
> *Proposed Solution:*
> Maintain a Cache of recently removed executors on Driver. During the 
> registration in BlockManagerMasterEndpoint if the BlockManager belongs to a 
> recently removed executor, return None indicating the registration is ignored 
> since the executor will be shutting down soon.
> On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed 
> executor, return true indicating the driver knows about it, thereby 
> preventing reregisteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400577#comment-17400577
 ] 

Apache Spark commented on SPARK-34949:
--

User 'sumeetgajjar' has created a pull request for this issue:
https://github.com/apache/spark/pull/33770

> Executor.reportHeartBeat reregisters blockManager even when Executor is 
> shutting down
> -
>
> Key: SPARK-34949
> URL: https://issues.apache.org/jira/browse/SPARK-34949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.2.0
> Environment: Resource Manager: K8s
>Reporter: Sumeet
>Assignee: Sumeet
>Priority: Major
>  Labels: Executor, heartbeat
> Fix For: 3.1.2, 3.2.0
>
>
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a 
> "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus"
>  * "CoarseGrainedExecutorBackend" starts the executor shutdown
>  * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and 
> removes the executor from "executorLastSeen"
>  * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" 
> cannot find the "executorId" in "executorLastSeen" and hence responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * The Executor now calls "env.blockManager.reregister()" and reregisters 
> itself thus creating inconsistency
>  
> *Proposed Solution:*
> The "reportHeartBeat" method is not aware of the fact that Executor is 
> shutting down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36536) Split the JSON/CSV option of datetime format to in read and in write

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36536:


Assignee: Apache Spark  (was: Max Gekk)

> Split the JSON/CSV option of datetime format to in read and in write
> 
>
> Key: SPARK-36536
> URL: https://issues.apache.org/jira/browse/SPARK-36536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-36418. 
> Need to split JSON and CSV options *dateFormat* and *timestampFormat*. In 
> write, should be the same but in read the option shouldn't be set to a 
> default value. In this way, DateFormatter and TimestampFormatter will use the 
> CAST logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36536) Split the JSON/CSV option of datetime format to in read and in write

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36536:


Assignee: Max Gekk  (was: Apache Spark)

> Split the JSON/CSV option of datetime format to in read and in write
> 
>
> Key: SPARK-36536
> URL: https://issues.apache.org/jira/browse/SPARK-36536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-36418. 
> Need to split JSON and CSV options *dateFormat* and *timestampFormat*. In 
> write, should be the same but in read the option shouldn't be set to a 
> default value. In this way, DateFormatter and TimestampFormatter will use the 
> CAST logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36536) Split the JSON/CSV option of datetime format to in read and in write

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400574#comment-17400574
 ] 

Apache Spark commented on SPARK-36536:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33769

> Split the JSON/CSV option of datetime format to in read and in write
> 
>
> Key: SPARK-36536
> URL: https://issues.apache.org/jira/browse/SPARK-36536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-36418. 
> Need to split JSON and CSV options *dateFormat* and *timestampFormat*. In 
> write, should be the same but in read the option shouldn't be set to a 
> default value. In this way, DateFormatter and TimestampFormatter will use the 
> CAST logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36536) Split the JSON/CSV option of datetime format to in read and in write

2021-08-17 Thread Max Gekk (Jira)

Max Gekk created SPARK-36536:


 Summary: Split the JSON/CSV option of datetime format to in read 
and in write
 Key: SPARK-36536
 URL: https://issues.apache.org/jira/browse/SPARK-36536
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk
Assignee: Max Gekk


This is a follow up of https://issues.apache.org/jira/browse/SPARK-36418. Need 
to split JSON and CSV options *dateFormat* and *timestampFormat*. In write, 
should be the same but in read the option shouldn't be set to a default value. 
In this way, DateFormatter and TimestampFormatter will use the CAST logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36370) Avoid using SelectionMixin._builtin_table which is removed in pandas 1.3

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400544#comment-17400544
 ] 

Apache Spark commented on SPARK-36370:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33768

> Avoid using SelectionMixin._builtin_table which is removed in pandas 1.3
> 
>
> Key: SPARK-36370
> URL: https://issues.apache.org/jira/browse/SPARK-36370
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36387) Fix Series.astype from datetime to nullable string

2021-08-17 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36387.
---
Fix Version/s: 3.3.0
 Assignee: Haejoon Lee
   Resolution: Fixed

Issue resolved by pull request 33735
https://github.com/apache/spark/pull/33735

> Fix Series.astype from datetime to nullable string
> --
>
> Key: SPARK-36387
> URL: https://issues.apache.org/jira/browse/SPARK-36387
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2021-08-12-14-24-31-321.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36535) refine the sql reference doc

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36535:


Assignee: (was: Apache Spark)

> refine the sql reference doc
> 
>
> Key: SPARK-36535
> URL: https://issues.apache.org/jira/browse/SPARK-36535
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36535) refine the sql reference doc

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400499#comment-17400499
 ] 

Apache Spark commented on SPARK-36535:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33767

> refine the sql reference doc
> 
>
> Key: SPARK-36535
> URL: https://issues.apache.org/jira/browse/SPARK-36535
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36535) refine the sql reference doc

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36535:


Assignee: Apache Spark

> refine the sql reference doc
> 
>
> Key: SPARK-36535
> URL: https://issues.apache.org/jira/browse/SPARK-36535
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36535) refine the sql reference doc

2021-08-17 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-36535:
---

 Summary: refine the sql reference doc
 Key: SPARK-36535
 URL: https://issues.apache.org/jira/browse/SPARK-36535
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27442) ParquetFileFormat fails to read column named with invalid characters

2021-08-17 Thread Dror Speiser (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400478#comment-17400478
 ] 

Dror Speiser commented on SPARK-27442:
--

Hey, I'm going over the parquet format specification (github page and thrift 
file), and I don't see any mention of valid or invalid characters for field 
names in schema elements. Was this a restriction in earlier format 
specifications? 

> ParquetFileFormat fails to read column named with invalid characters
> 
>
> Key: SPARK-27442
> URL: https://issues.apache.org/jira/browse/SPARK-27442
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0, 2.4.1
>Reporter: Jan Vršovský
>Priority: Minor
>
> When reading a parquet file which contains characters considered invalid, the 
> reader fails with exception:
> Name: org.apache.spark.sql.AnalysisException
> Message: Attribute name "..." contains invalid character(s) among " 
> ,;{}()\n\t=". Please use alias to rename it.
> Spark should not be able to write such files, but it should be able to read 
> it (and allow the user to correct it). However, possible workarounds (such as 
> using alias to rename the column, or forcing another schema) do not work, 
> since the check is done on the input.
> (Possible fix: remove superficial 
> {{ParquetWriteSupport.setSchema(requiredSchema, hadoopConf)}} from 
> {{buildReaderWithPartitionValues}} ?)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36352) Spark should check result plan's output schema name

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400445#comment-17400445
 ] 

Apache Spark commented on SPARK-36352:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33764

> Spark should check result plan's output schema name
> ---
>
> Key: SPARK-36352
> URL: https://issues.apache.org/jira/browse/SPARK-36352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36352) Spark should check result plan's output schema name

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400444#comment-17400444
 ] 

Apache Spark commented on SPARK-36352:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33764

> Spark should check result plan's output schema name
> ---
>
> Key: SPARK-36352
> URL: https://issues.apache.org/jira/browse/SPARK-36352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36052) Introduce pending pod limit for Spark on K8s

2021-08-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36052:
--
Labels: releasenotes  (was: )

> Introduce pending pod limit for Spark on K8s
> 
>
> Key: SPARK-36052
> URL: https://issues.apache.org/jira/browse/SPARK-36052
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0, 3.3.0
>
>
> Introduce a new configuration to limit the number of pending PODs for Spark 
> on K8S as the K8S scheduler could be overloaded with requests which slows 
> down the resource allocations (especially in case of dynamic allocation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23693) SQL function uuid()

2021-08-17 Thread Jean Georges Perrin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400411#comment-17400411
 ] 

Jean Georges Perrin commented on SPARK-23693:
-

[~rxin] - You could require a parameter to the function this should make it 
deterministic.

> SQL function uuid()
> ---
>
> Key: SPARK-23693
> URL: https://issues.apache.org/jira/browse/SPARK-23693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>
> Add function uuid() to org.apache.spark.sql.functions that returns 
> [Universally Unique 
> ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].
> Sometimes it is necessary to uniquely identify each row in a DataFrame.
> Currently the following ways are available:
>  * monotonically_increasing_id() function
>  * row_number() function over some window
>  * convert the DataFrame to RDD and zipWithIndex()
> All these approaches do not work when appending this DataFrame to another 
> DataFrame (union). Collisions may occur - two rows in different DataFrames 
> may have the same ID. Re-generating IDs on the resulting DataFrame is not an 
> option, because some data in some other system may already refer to old IDs.
> The proposed solution is to add new function:
> {code:scala}
> def uuid(): Column
> {code}
> that returns String representation of UUID.
> UUID is represented as a 128-bit number (two long numbers). Such numbers are 
> not supported in Scala or Java. In addition, some storage systems do not 
> support 128-bit numbers (Parquet's largest numeric type is INT96). This is 
> the reason for the uuid() function to return String.
> I already have a simple implementation based on 
> [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
> can share it as a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36533) Allow streaming queries with Trigger.Once run in multiple batches

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400382#comment-17400382
 ] 

Apache Spark commented on SPARK-36533:
--

User 'bozhang2820' has created a pull request for this issue:
https://github.com/apache/spark/pull/33763

> Allow streaming queries with Trigger.Once run in multiple batches
> -
>
> Key: SPARK-36533
> URL: https://issues.apache.org/jira/browse/SPARK-36533
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Bo Zhang
>Priority: Major
>
> Currently streaming queries with Trigger.Once will always load all of the 
> available data in a single batch. Because of this, the amount of data the 
> queries can process is limited, or Spark driver will be out of memory. 
> We should allow streaming queries with Trigger.Once run in multiple batches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36533) Allow streaming queries with Trigger.Once run in multiple batches

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36533:


Assignee: Apache Spark

> Allow streaming queries with Trigger.Once run in multiple batches
> -
>
> Key: SPARK-36533
> URL: https://issues.apache.org/jira/browse/SPARK-36533
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Bo Zhang
>Assignee: Apache Spark
>Priority: Major
>
> Currently streaming queries with Trigger.Once will always load all of the 
> available data in a single batch. Because of this, the amount of data the 
> queries can process is limited, or Spark driver will be out of memory. 
> We should allow streaming queries with Trigger.Once run in multiple batches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36533) Allow streaming queries with Trigger.Once run in multiple batches

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36533:


Assignee: (was: Apache Spark)

> Allow streaming queries with Trigger.Once run in multiple batches
> -
>
> Key: SPARK-36533
> URL: https://issues.apache.org/jira/browse/SPARK-36533
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Bo Zhang
>Priority: Major
>
> Currently streaming queries with Trigger.Once will always load all of the 
> available data in a single batch. Because of this, the amount of data the 
> queries can process is limited, or Spark driver will be out of memory. 
> We should allow streaming queries with Trigger.Once run in multiple batches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36493) Skip Retrieving keytab with SparkFiles.get if keytab found in the CWD of Yarn Container

2021-08-17 Thread Zikun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zikun updated SPARK-36493:
--
Summary: Skip Retrieving keytab with SparkFiles.get if keytab found in the 
CWD of Yarn Container  (was: SparkFiles.get is not needed for the JDBC keytab 
provided by the "--files" option)

> Skip Retrieving keytab with SparkFiles.get if keytab found in the CWD of Yarn 
> Container
> ---
>
> Key: SPARK-36493
> URL: https://issues.apache.org/jira/browse/SPARK-36493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.2
>Reporter: Zikun
>Priority: Major
> Fix For: 3.1.3
>
>
> Currently we have the logic to deal with the JDBC keytab provided by the 
> "--files" option
> {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}}
>  \{{{}}
>  {{}}{{val result = SparkFiles.get(keytabParam)}}
>  {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on 
> executor: $result")}}
>  {{}}{{result}}
>  {{}}} {{else {}}
>  {{}}{{logDebug("Keytab path found, assuming manual upload")}}
>  {{}}{{keytabParam}}
>  {{}}}
> Spark has already created the soft link for any file submitted by the 
> "--files" option. Here is an example.
> testusera1.keytab -> 
> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab
>  
> So there is no need to call the SparkFiles.get to absolute path of the keytab 
> file. We can directly use the variable `keytabParam` as the keytab file path.
>  
> Moreover, SparkFiles.get will get a wrong path of keytab for the driver in 
> cluster mode. In cluster mode, the keytab is distributed to the following 
> location for both the driver and executors
> {{/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/container_1628584679772_0030_01_01/testusera1.keytab}}
> but SparkFiles.get brings the following wrong location for the driver
> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36493) SparkFiles.get is not needed for the JDBC keytab provided by the "--files" option

2021-08-17 Thread Zikun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zikun updated SPARK-36493:
--
Description: 
Currently we have the logic to deal with the JDBC keytab provided by the 
"--files" option

{{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}}
 \{{{}}
 {{}}{{val result = SparkFiles.get(keytabParam)}}
 {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on 
executor: $result")}}
 {{}}{{result}}
 {{}}} {{else {}}
 {{}}{{logDebug("Keytab path found, assuming manual upload")}}
 {{}}{{keytabParam}}
 {{}}}

Spark has already created the soft link for any file submitted by the "--files" 
option. Here is an example.

testusera1.keytab -> 
/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab

 

So there is no need to call the SparkFiles.get to absolute path of the keytab 
file. We can directly use the variable `keytabParam` as the keytab file path.

 

Moreover, SparkFiles.get will get a wrong path of keytab for the driver in 
cluster mode. In cluster mode, the keytab is distributed to the following 
location for both the driver and executors

{{/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/container_1628584679772_0030_01_01/testusera1.keytab}}

but SparkFiles.get brings the following wrong location for the driver

/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab

 

 

  was:
Currently we have the logic to deal with the JDBC keytab provided by the 
"--files" option

{{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}}
{{{}}
{{}}{{val result = SparkFiles.get(keytabParam)}}
{{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on 
executor: $result")}}
{{}}{{result}}
{{}}} {{else {}}
{{}}{{logDebug("Keytab path found, assuming manual upload")}}
{{}}{{keytabParam}}
{{}}}

Spark has already created the soft link for any file submitted by the "--files" 
option. Here is an example.

testusera1.keytab -> 
/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab

 

So there is no need to call the SparkFiles.get to absolute path of the keytab 
file. We can directly use the variable `keytabParam` as the keytab file path.

 

Moreover, SparkFiles.get will get a wrong path of keytab for the driver in 
cluster mode. In cluster mode, the keytab is distributed to the following 
location for both the driver and executors

/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab

but SparkFiles.get brings the following wrong location for the driver

/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab

 

 


> SparkFiles.get is not needed for the JDBC keytab provided by the "--files" 
> option
> -
>
> Key: SPARK-36493
> URL: https://issues.apache.org/jira/browse/SPARK-36493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.2
>Reporter: Zikun
>Priority: Major
> Fix For: 3.1.3
>
>
> Currently we have the logic to deal with the JDBC keytab provided by the 
> "--files" option
> {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}}
>  \{{{}}
>  {{}}{{val result = SparkFiles.get(keytabParam)}}
>  {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on 
> executor: $result")}}
>  {{}}{{result}}
>  {{}}} {{else {}}
>  {{}}{{logDebug("Keytab path found, assuming manual upload")}}
>  {{}}{{keytabParam}}
>  {{}}}
> Spark has already created the soft link for any file submitted by the 
> "--files" option. Here is an example.
> testusera1.keytab -> 
> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab
>  
> So there is no need to call the SparkFiles.get to absolute path of the keytab 
> file. We can directly use the variable `keytabParam` as the keytab file path.
>  
> Moreover, SparkFiles.get will get a wrong path of keytab for the driver in 
> cluster mode. In cluster mode, the keytab is distributed to the following 
> location for both the driver and executors
> {{/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/container_1628584679772_0030_01_01/testusera1.keytab}}
> but SparkFiles.get brings the following wrong loc

[jira] [Commented] (SPARK-35028) ANSI mode: disallow group by aliases

2021-08-17 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400357#comment-17400357
 ] 

Gengliang Wang commented on SPARK-35028:


This is reverted in https://github.com/apache/spark/pull/33758

> ANSI mode: disallow group by aliases
> 
>
> Key: SPARK-35028
> URL: https://issues.apache.org/jira/browse/SPARK-35028
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> As per the ANSI SQL standard secion 7.12 : 
> bq. Each  shall unambiguously reference a column 
> of the table resulting from the . A column referenced in a 
>  is a grouping column.
> By forbidding it, we can avoid ambiguous SQL queries like:
> SELECT col + 1 as col FROM t GROUP BY col



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36493) SparkFiles.get is not needed for the JDBC keytab provided by the "--files" option

2021-08-17 Thread Zikun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zikun updated SPARK-36493:
--
Description: 
Currently we have the logic to deal with the JDBC keytab provided by the 
"--files" option

{{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}}
{{{}}
{{}}{{val result = SparkFiles.get(keytabParam)}}
{{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on 
executor: $result")}}
{{}}{{result}}
{{}}} {{else {}}
{{}}{{logDebug("Keytab path found, assuming manual upload")}}
{{}}{{keytabParam}}
{{}}}

Spark has already created the soft link for any file submitted by the "--files" 
option. Here is an example.

testusera1.keytab -> 
/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab

 

So there is no need to call the SparkFiles.get to absolute path of the keytab 
file. We can directly use the variable `keytabParam` as the keytab file path.

 

Moreover, SparkFiles.get will get a wrong path of keytab for the driver in 
cluster mode. In cluster mode, the keytab is distributed to the following 
location for both the driver and executors

/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab

but SparkFiles.get brings the following wrong location for the driver

/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab

 

 

  was:
Currently we have the logic to deal with the JDBC keytab provided by the 
"--files" option

if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)

{     val result = SparkFiles.get(keytabParam)        

logDebug(s"Keytab path not found, assuming --files, file name used on executor: 
$result")        

result }

Spark has already created the soft link for any file submitted by the "--files" 
option. Here is an example.

testusera1.keytab -> 
/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab

 

So there is no need to call the SparkFiles.get to absolute path of the keytab 
file. We can directly use the variable `keytabParam` as the keytab file path.

 

Moreover, SparkFiles.get will get a wrong path of keytab. In a running Spark 
cluster, the keytab is distributed to the following location

/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab

but SparkFiles.get brings the following wrong location

/var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab

 

 


> SparkFiles.get is not needed for the JDBC keytab provided by the "--files" 
> option
> -
>
> Key: SPARK-36493
> URL: https://issues.apache.org/jira/browse/SPARK-36493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.2
>Reporter: Zikun
>Priority: Major
> Fix For: 3.1.3
>
>
> Currently we have the logic to deal with the JDBC keytab provided by the 
> "--files" option
> {{if (keytabParam != null && FilenameUtils.getPath(keytabParam).isEmpty)}}
> {{{}}
> {{}}{{val result = SparkFiles.get(keytabParam)}}
> {{}}{{logDebug(s"Keytab path not found, assuming --files, file name used on 
> executor: $result")}}
> {{}}{{result}}
> {{}}} {{else {}}
> {{}}{{logDebug("Keytab path found, assuming manual upload")}}
> {{}}{{keytabParam}}
> {{}}}
> Spark has already created the soft link for any file submitted by the 
> "--files" option. Here is an example.
> testusera1.keytab -> 
> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab
>  
> So there is no need to call the SparkFiles.get to absolute path of the keytab 
> file. We can directly use the variable `keytabParam` as the keytab file path.
>  
> Moreover, SparkFiles.get will get a wrong path of keytab for the driver in 
> cluster mode. In cluster mode, the keytab is distributed to the following 
> location for both the driver and executors
> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/filecache/12/testusera1.keytab
> but SparkFiles.get brings the following wrong location for the driver
> /var/opt/hadoop/temp/nm-local-dir/usercache/testusera1/appcache/application_1628584679772_0003/spark-8fb0f437-c842-4a9f-9612-39de40082e40/userFiles-5075388b-0928-4bc3-a498-7f6c84b27808/testusera1.keytab
>  
>  



--
This message was sen

[jira] [Updated] (SPARK-36534) No way to check If Spark Session is created successfully with No Exceptions and ready to execute Tasks

2021-08-17 Thread Jahar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jahar updated SPARK-36534:
--
Description: 
I am running Spark on Kubernetes in Client mode. Spark driver is spawned 
programmatically (No Spark-Submit). Below is the dummy code to set SparkSession 
with KubeApiServer as Master.

 
{code:java}
// code placeholder
private static SparkSession getSparkSession()
{
mySparkSessionBuilder = SparkSession.builder()
  .master("k8s://http://:6443")
  .appName("spark-K8sDemo")
  .config("spark.kubernetes.container.image","spark:3.0")
  .appName("spark-K8sDemo")
  .config("spark.jars", 
"/tmp/jt/database-0.0.1-SNAPSHOT-jar-with-dependencies.jar")
  
.config("spark.kubernetes.executor.podTemplateFile","/tmp/jt/sparkExecutorPodTemplate.yaml")
  .config("spark.kubernetes.container.image.pullPolicy","Always")
  .config("spark.kubernetes.namespace","my_namespace")
  .config("spark.driver.host", "spark-driver-example")
  .config("spark.driver.port", "29413")
  
.config("spark.kubernetes.authenticate.driver.serviceAccountName","spark")
  .config("spark.extraListeners","K8sPoc.MyHealthCheckListener");
setAditionalConfig();
mySession= mySparkSessionBuilder.getOrCreate();
return mySession;
}
{code}
 

Now the  problem is that, in certain scenarios like if K8s master is not 
reachable or master URL is incorrect or spark.kubernetes.container.image config 
is missing then it throws below exceptions (*Exception 1* and *Exception 2* 
given below).

These exceptions are never propagated to Spark Driver program which in turn 
makes Spark Application in stuck state forever.

There should be a way to know via SparkSession or SparkContext object if 
Session was created successful without any such exceptions and can run 
SparkTasks??

I have looked at SparkSession, SparkContext API documentation and 
SparkListeners but didn't find any such way to check if SparkSession is ready 
to run the Tasks or if not then dont keep the Spark Application in hanging 
state rather return a proper error/warn message to calling API.

 

*Exception 1: (If _spark.kubernetes.container.image_ config is missing:*

 
{code:java}
21/08/16 16:27:07 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources 21/08/16 16:27:07 INFO ExecutorPodsAllocator: Going to 
request 2 executors from Kubernetes. 21/08/16 16:27:07 ERROR Utils: Uncaught 
exception in thread kubernetes-executor-snapshots-subscribers-1 
org.apache.spark.SparkException: Must specify the executor container image at 
org.apache.spark.deploy.k8s.features.BasicExecutorFeatureStep.$anonfun$executorContainerImage$1(BasicExecutorFeatureStep.scala:41)
 at scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.deploy.k8s.features.BasicExecutorFeatureStep.(BasicExecutorFeatureStep.scala:41)
 at 
org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:43)
 at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$16(ExecutorPodsAllocator.scala:216)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:208)
 at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$1(ExecutorPodsAllocator.scala:82)
 at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$1$adapted(ExecutorPodsAllocator.scala:82)
 at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$callSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:110)
 at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber(ExecutorPodsSnapshotsStoreImpl.scala:107)
 at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:71)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) {code}
{noformat}

{noformat}

*Exception 2: (If _K8s master_ is not reachable or w

[jira] [Created] (SPARK-36534) No way to check If Spark Session is created successfully with No Exceptions and ready to execute Tasks

2021-08-17 Thread Jahar (Jira)

Jahar created SPARK-36534:
-

 Summary:  No way to check If Spark Session is created successfully 
with No Exceptions and ready to execute Tasks
 Key: SPARK-36534
 URL: https://issues.apache.org/jira/browse/SPARK-36534
 Project: Spark
  Issue Type: Improvement
  Components: Java API, Kubernetes, Scheduler
Affects Versions: 3.0.1
 Environment: *Spark 3.0.1*
Reporter: Jahar


I am running Spark on Kubernetes in Client mode. Spark driver is spawned 
programmatically (No Spark-Submit). Below is the dummy code to set SparkSession 
with KubeApiServer as Master.

 
{code:java}
// code placeholder
private static SparkSession getSparkSession()
{
mySparkSessionBuilder = SparkSession.builder()
  .master("k8s://http://:6443")
  .appName("spark-K8sDemo")
  .config("spark.kubernetes.container.image","spark:3.0")
  .appName("spark-K8sDemo")
  .config("spark.jars", 
"/tmp/jt/database-0.0.1-SNAPSHOT-jar-with-dependencies.jar")
  
.config("spark.kubernetes.executor.podTemplateFile","/tmp/jt/sparkExecutorPodTemplate.yaml")
  .config("spark.kubernetes.container.image.pullPolicy","Always")
  .config("spark.kubernetes.namespace","my_namespace")
  .config("spark.driver.host", "spark-driver-example")
  .config("spark.driver.port", "29413")
  
.config("spark.kubernetes.authenticate.driver.serviceAccountName","spark")
  .config("spark.extraListeners","K8sPoc.MyHealthCheckListener");
setAditionalConfig();
mySession= mySparkSessionBuilder.getOrCreate();
return mySession;
}
{code}
 

Now the  problem is that, in certain scenarios like if K8s master is not 
reachable or master URL is incorrect or spark.kubernetes.container.image config 
is missing then it throws below exceptions (*Exception 1* and *Exception 2* 
given below).

These exceptions are never propagated to Spark Driver program which in turn 
makes Spark Application in stuck state forever.

There should be a way to know via SparkSession or SparkContext object if 
Session was created successful without any such exceptions and can run 
SparkTasks??

I have looked at SparkSession, SparkContext API documentation and 
SparkListeners but didn't find any such way to check if SparkSession is ready 
to run the Tasks or if not then dont keep the Spark Application in hanging 
state rather return a proper error/warn message to calling API.

 

*Exception 1: (If _spark.kubernetes.container.image_ config is missing:*

 

 
{noformat}

{noformat}
 
{noformat}
21/08/16 16:27:07 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources 21/08/16 16:27:07 INFO ExecutorPodsAllocator: Going to 
request 2 executors from Kubernetes. 21/08/16 16:27:07 ERROR Utils: Uncaught 
exception in thread kubernetes-executor-snapshots-subscribers-1 
org.apache.spark.SparkException: Must specify the executor container image at 
org.apache.spark.deploy.k8s.features.BasicExecutorFeatureStep.$anonfun$executorContainerImage$1(BasicExecutorFeatureStep.scala:41)
 at scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.deploy.k8s.features.BasicExecutorFeatureStep.(BasicExecutorFeatureStep.scala:41)
 at 
org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:43)
 at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$16(ExecutorPodsAllocator.scala:216)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:208)
 at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$1(ExecutorPodsAllocator.scala:82)
 at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$1$adapted(ExecutorPodsAllocator.scala:82)
 at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$callSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:110)
 at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber(ExecutorPodsSnapshotsStoreImpl.scala:107)
 at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:71)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Schedul

[jira] [Commented] (SPARK-36379) Null at root level of a JSON array causes the parsing failure (w/ permissive mode)

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400314#comment-17400314
 ] 

Apache Spark commented on SPARK-36379:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33762

> Null at root level of a JSON array causes the parsing failure (w/ permissive 
> mode)
> --
>
> Key: SPARK-36379
> URL: https://issues.apache.org/jira/browse/SPARK-36379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.2.0, 3.3.0
>
>
> {code}
> scala> spark.read.json(Seq("""[{"a": "str"}, null, {"a": 
> "str"}]""").toDS).collect()
> {code}
> {code}
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1) (172.30.3.20 executor driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
> {code}
> Since the mode (by default) is permissive, we shouldn't just fail like above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35426) When addMergerLocation exceed the maxRetainedMergerLocations , we should remove the merger based on merged shuffle data size.

2021-08-17 Thread Qi Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated SPARK-35426:
---
Description: 
Now When addMergerLocation exceed the maxRetainedMergerLocations , we just 
remove the oldest merger, but we'd better remove the merger based on merged 
shuffle data size. 

We should remove mergers with the largest amount of merged shuffle data, so 
that the remaining mergers have potentially more disk space to store new merged 
shuffle data

  was:
Now When addMergerLocation exceed the maxRetainedMergerLocations , we just 
remove the oldest merger, but we'd better remove the merger based on merged 
shuffle data size. 

 


> When addMergerLocation exceed the maxRetainedMergerLocations , we should 
> remove the merger based on merged shuffle data size.
> -
>
> Key: SPARK-35426
> URL: https://issues.apache.org/jira/browse/SPARK-35426
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Qi Zhu
>Priority: Major
>
> Now When addMergerLocation exceed the maxRetainedMergerLocations , we just 
> remove the oldest merger, but we'd better remove the merger based on merged 
> shuffle data size. 
> We should remove mergers with the largest amount of merged shuffle data, so 
> that the remaining mergers have potentially more disk space to store new 
> merged shuffle data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35426) When addMergerLocation exceed the maxRetainedMergerLocations , we should remove the merger based on merged shuffle data size.

2021-08-17 Thread Qi Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated SPARK-35426:
---
Description: 
Now When addMergerLocation exceed the maxRetainedMergerLocations , we just 
remove the oldest merger, but we'd better remove the merger based on merged 
shuffle data size. 

 

  was:
Now When addMergerLocation exceed the maxRetainedMergerLocations , we just 
remove the oldest merger, but we'd better remove the merger based on merged 
shuffle data size. 

The oldest merger may have big merged shuffle data size, it will not be a good 
choice to do so.


> When addMergerLocation exceed the maxRetainedMergerLocations , we should 
> remove the merger based on merged shuffle data size.
> -
>
> Key: SPARK-35426
> URL: https://issues.apache.org/jira/browse/SPARK-35426
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Qi Zhu
>Priority: Major
>
> Now When addMergerLocation exceed the maxRetainedMergerLocations , we just 
> remove the oldest merger, but we'd better remove the merger based on merged 
> shuffle data size. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36533) Allow streaming queries with Trigger.Once run in multiple batches

2021-08-17 Thread Bo Zhang (Jira)

Bo Zhang created SPARK-36533:


 Summary: Allow streaming queries with Trigger.Once run in multiple 
batches
 Key: SPARK-36533
 URL: https://issues.apache.org/jira/browse/SPARK-36533
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: Bo Zhang


Currently streaming queries with Trigger.Once will always load all of the 
available data in a single batch. Because of this, the amount of data the 
queries can process is limited, or Spark driver will be out of memory. 

We should allow streaming queries with Trigger.Once run in multiple batches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36524) Add common class/trait for ANSI interval types

2021-08-17 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-36524.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33753
[https://github.com/apache/spark/pull/33753]

> Add common class/trait for ANSI interval types
> --
>
> Key: SPARK-36524
> URL: https://issues.apache.org/jira/browse/SPARK-36524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, there are many places where we check both YearMonthIntervalType 
> and DayTimeIntervalType in the same match case, like
> {code:scala}
> case _: YearMonthIntervalType | _: DayTimeIntervalType => false
> {code}
> Need to add new trait or abstract class that should be extended by 
> YearMonthIntervalType and DayTimeIntervalType. So, we can transform the code 
> above to:
> {code:scala}
> case _: AnsiIntervalType => false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36518) spark should support distribute directory to cluster

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36518:


Assignee: Apache Spark

> spark should support distribute directory to cluster
> 
>
> Key: SPARK-36518
> URL: https://issues.apache.org/jira/browse/SPARK-36518
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: YuanGuanhu
>Assignee: Apache Spark
>Priority: Major
>
> Spark now only supports distribute files to cluster, but in some scenario, we 
> need upload a directory to cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36518) spark should support distribute directory to cluster

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36518:


Assignee: (was: Apache Spark)

> spark should support distribute directory to cluster
> 
>
> Key: SPARK-36518
> URL: https://issues.apache.org/jira/browse/SPARK-36518
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: YuanGuanhu
>Priority: Major
>
> Spark now only supports distribute files to cluster, but in some scenario, we 
> need upload a directory to cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36518) spark should support distribute directory to cluster

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400231#comment-17400231
 ] 

Apache Spark commented on SPARK-36518:
--

User 'fhygh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33760

> spark should support distribute directory to cluster
> 
>
> Key: SPARK-36518
> URL: https://issues.apache.org/jira/browse/SPARK-36518
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: YuanGuanhu
>Priority: Major
>
> Spark now only supports distribute files to cluster, but in some scenario, we 
> need upload a directory to cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36532) Deadlock in CoarseGrainedExecutorBackend.onDisconnected

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36532:


Assignee: (was: Apache Spark)

> Deadlock in CoarseGrainedExecutorBackend.onDisconnected
> ---
>
> Key: SPARK-36532
> URL: https://issues.apache.org/jira/browse/SPARK-36532
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Major
>
> The deadlock has the exactly same root cause as SPARK-14180 but just happens 
> in a different code path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36532) Deadlock in CoarseGrainedExecutorBackend.onDisconnected

2021-08-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36532:


Assignee: Apache Spark

> Deadlock in CoarseGrainedExecutorBackend.onDisconnected
> ---
>
> Key: SPARK-36532
> URL: https://issues.apache.org/jira/browse/SPARK-36532
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> The deadlock has the exactly same root cause as SPARK-14180 but just happens 
> in a different code path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36532) Deadlock in CoarseGrainedExecutorBackend.onDisconnected

2021-08-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400188#comment-17400188
 ] 

Apache Spark commented on SPARK-36532:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/33759

> Deadlock in CoarseGrainedExecutorBackend.onDisconnected
> ---
>
> Key: SPARK-36532
> URL: https://issues.apache.org/jira/browse/SPARK-36532
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Major
>
> The deadlock has the exactly same root cause as SPARK-14180 but just happens 
> in a different code path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

74 matches

Mail list logo