[jira] [Commented] (SPARK-36872) Decommissioning executors get killed before transferring their data because of the hardcoded timeout of 60 secs

2022-01-27 Thread Abhishek Rao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483537#comment-17483537
 ] 

Abhishek Rao commented on SPARK-36872:
--

Thanks. We'll have a look at this.

> Decommissioning executors get killed before transferring their data because 
> of the hardcoded timeout of 60 secs
> ---
>
> Key: SPARK-36872
> URL: https://issues.apache.org/jira/browse/SPARK-36872
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1, 3.1.2, 3.2.0
>Reporter: Shekhar Gupta
>Priority: Trivial
>
> During the graceful decommissioning phase, executors need to transfer all of 
> their shuffle and cache data to the peer executors. However, they get killed 
> before transferring all the data because of the hardcoded timeout value of 60 
> secs in the decommissioning script. As a result of executors dying 
> prematurely, the spark tasks on other executors fail which causes application 
> failures, and it is hard to debug those failures. To fix the issue, we ended 
> up writing a custom script with a different timeout and rebuilt the spark 
> image but we would prefer an easier solution that does not require rebuilding 
> the image. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36872) Decommissioning executors get killed before transferring their data because of the hardcoded timeout of 60 secs

2022-01-27 Thread Shekhar Gupta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483281#comment-17483281
 ] 

Shekhar Gupta commented on SPARK-36872:
---

[~abhisrao] I am referring to the following script:

 

[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/decom.sh]

 

 

> Decommissioning executors get killed before transferring their data because 
> of the hardcoded timeout of 60 secs
> ---
>
> Key: SPARK-36872
> URL: https://issues.apache.org/jira/browse/SPARK-36872
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1, 3.1.2, 3.2.0
>Reporter: Shekhar Gupta
>Priority: Trivial
>
> During the graceful decommissioning phase, executors need to transfer all of 
> their shuffle and cache data to the peer executors. However, they get killed 
> before transferring all the data because of the hardcoded timeout value of 60 
> secs in the decommissioning script. As a result of executors dying 
> prematurely, the spark tasks on other executors fail which causes application 
> failures, and it is hard to debug those failures. To fix the issue, we ended 
> up writing a custom script with a different timeout and rebuilt the spark 
> image but we would prefer an easier solution that does not require rebuilding 
> the image. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36872) Decommissioning executors get killed before transferring their data because of the hardcoded timeout of 60 secs

2022-01-26 Thread Abhishek Rao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482861#comment-17482861
 ] 

Abhishek Rao commented on SPARK-36872:
--

[~shkhrgpt], could you please share more details on which script you are 
referring to? We're facing similar issues and we're looking for options to fix 
this.

> Decommissioning executors get killed before transferring their data because 
> of the hardcoded timeout of 60 secs
> ---
>
> Key: SPARK-36872
> URL: https://issues.apache.org/jira/browse/SPARK-36872
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1, 3.1.2, 3.2.0
>Reporter: Shekhar Gupta
>Priority: Trivial
>
> During the graceful decommissioning phase, executors need to transfer all of 
> their shuffle and cache data to the peer executors. However, they get killed 
> before transferring all the data because of the hardcoded timeout value of 60 
> secs in the decommissioning script. As a result of executors dying 
> prematurely, the spark tasks on other executors fail which causes application 
> failures, and it is hard to debug those failures. To fix the issue, we ended 
> up writing a custom script with a different timeout and rebuilt the spark 
> image but we would prefer an easier solution that does not require rebuilding 
> the image. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org