[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-04 Thread Matt Cheah (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464467#comment-16464467 ] Matt Cheah commented on SPARK-24135: Put up the PR< see above - created a separate setting for this

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-04 Thread Apache Spark (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464455#comment-16464455 ] Apache Spark commented on SPARK-24135: -- User 'mccheah' has created a pull request for this issue:

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Imran Rashid (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462540#comment-16462540 ] Imran Rashid commented on SPARK-24135: -- Honestly I don't understand the failure mode described here

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Anirudh Ramanathan (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462145#comment-16462145 ] Anirudh Ramanathan commented on SPARK-24135: cc/ [~mridulm80] [~irashid] for thoughts on

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Anirudh Ramanathan (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462139#comment-16462139 ] Anirudh Ramanathan commented on SPARK-24135: It is increasingly common for people to write

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Matt Cheah (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462106#comment-16462106 ] Matt Cheah commented on SPARK-24135: Not necessarily - if the pods fail to start up, we should retry

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-02 Thread Erik Erlandson (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461324#comment-16461324 ] Erik Erlandson commented on SPARK-24135: > In the case of the executor failing to start at all,

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-02 Thread Matt Cheah (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461188#comment-16461188 ] Matt Cheah commented on SPARK-24135: > Restarting seems like it would eventually be limited by the

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-02 Thread Erik Erlandson (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461154#comment-16461154 ] Erik Erlandson commented on SPARK-24135: IIRC the dynamic allocation heuristic was to avoid

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-02 Thread Matt Cheah (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461024#comment-16461024 ] Matt Cheah commented on SPARK-24135: I think we should not count these towards job failures, and that

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-02 Thread Anirudh Ramanathan (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460559#comment-16460559 ] Anirudh Ramanathan commented on SPARK-24135: +1 to detecting all pod error states and doing

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-01 Thread Yinan Li (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460066#comment-16460066 ] Yinan Li commented on SPARK-24135: -- I agree that we should add detection for initialization errors. But

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-01 Thread Matt Cheah (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460047#comment-16460047 ] Matt Cheah commented on SPARK-24135: _> But I'm not sure how much this buys us because very likely

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-01 Thread Erik Erlandson (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459954#comment-16459954 ] Erik Erlandson commented on SPARK-24135: I think it makes sense to detect these failure states. 

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-01 Thread Yinan Li (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459892#comment-16459892 ] Yinan Li commented on SPARK-24135: -- I think it's fine detecting and deleting the executor pods that

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-01 Thread Matt Cheah (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459754#comment-16459754 ] Matt Cheah commented on SPARK-24135: [~foxish] [~eje] [~liyinan926] wanted to get feedback on this -