[jira] [Commented] (SPARK-36872) Decommissioning executors get killed before transferring their data because of the hardcoded timeout of 60 secs
[ https://issues.apache.org/jira/browse/SPARK-36872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483537#comment-17483537 ] Abhishek Rao commented on SPARK-36872: -- Thanks. We'll have a look at this. > Decommissioning executors get killed before transferring their data because > of the hardcoded timeout of 60 secs > --- > > Key: SPARK-36872 > URL: https://issues.apache.org/jira/browse/SPARK-36872 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1, 3.1.2, 3.2.0 >Reporter: Shekhar Gupta >Priority: Trivial > > During the graceful decommissioning phase, executors need to transfer all of > their shuffle and cache data to the peer executors. However, they get killed > before transferring all the data because of the hardcoded timeout value of 60 > secs in the decommissioning script. As a result of executors dying > prematurely, the spark tasks on other executors fail which causes application > failures, and it is hard to debug those failures. To fix the issue, we ended > up writing a custom script with a different timeout and rebuilt the spark > image but we would prefer an easier solution that does not require rebuilding > the image. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36872) Decommissioning executors get killed before transferring their data because of the hardcoded timeout of 60 secs
[ https://issues.apache.org/jira/browse/SPARK-36872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483281#comment-17483281 ] Shekhar Gupta commented on SPARK-36872: --- [~abhisrao] I am referring to the following script: [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/decom.sh] > Decommissioning executors get killed before transferring their data because > of the hardcoded timeout of 60 secs > --- > > Key: SPARK-36872 > URL: https://issues.apache.org/jira/browse/SPARK-36872 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1, 3.1.2, 3.2.0 >Reporter: Shekhar Gupta >Priority: Trivial > > During the graceful decommissioning phase, executors need to transfer all of > their shuffle and cache data to the peer executors. However, they get killed > before transferring all the data because of the hardcoded timeout value of 60 > secs in the decommissioning script. As a result of executors dying > prematurely, the spark tasks on other executors fail which causes application > failures, and it is hard to debug those failures. To fix the issue, we ended > up writing a custom script with a different timeout and rebuilt the spark > image but we would prefer an easier solution that does not require rebuilding > the image. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36872) Decommissioning executors get killed before transferring their data because of the hardcoded timeout of 60 secs
[ https://issues.apache.org/jira/browse/SPARK-36872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482861#comment-17482861 ] Abhishek Rao commented on SPARK-36872: -- [~shkhrgpt], could you please share more details on which script you are referring to? We're facing similar issues and we're looking for options to fix this. > Decommissioning executors get killed before transferring their data because > of the hardcoded timeout of 60 secs > --- > > Key: SPARK-36872 > URL: https://issues.apache.org/jira/browse/SPARK-36872 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1, 3.1.2, 3.2.0 >Reporter: Shekhar Gupta >Priority: Trivial > > During the graceful decommissioning phase, executors need to transfer all of > their shuffle and cache data to the peer executors. However, they get killed > before transferring all the data because of the hardcoded timeout value of 60 > secs in the decommissioning script. As a result of executors dying > prematurely, the spark tasks on other executors fail which causes application > failures, and it is hard to debug those failures. To fix the issue, we ended > up writing a custom script with a different timeout and rebuilt the spark > image but we would prefer an easier solution that does not require rebuilding > the image. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org