[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464467#comment-16464467
]
Matt Cheah commented on SPARK-24135:
Put up the PR< see above - created a separate setting for this
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464455#comment-16464455
]
Apache Spark commented on SPARK-24135:
--
User 'mccheah' has created a pull request for this issue:
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462540#comment-16462540
]
Imran Rashid commented on SPARK-24135:
--
Honestly I don't understand the failure mode described here
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462145#comment-16462145
]
Anirudh Ramanathan commented on SPARK-24135:
cc/ [~mridulm80] [~irashid] for thoughts on
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462139#comment-16462139
]
Anirudh Ramanathan commented on SPARK-24135:
It is increasingly common for people to write
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462106#comment-16462106
]
Matt Cheah commented on SPARK-24135:
Not necessarily - if the pods fail to start up, we should retry
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461324#comment-16461324
]
Erik Erlandson commented on SPARK-24135:
> In the case of the executor failing to start at all,
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461188#comment-16461188
]
Matt Cheah commented on SPARK-24135:
> Restarting seems like it would eventually be limited by the
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461154#comment-16461154
]
Erik Erlandson commented on SPARK-24135:
IIRC the dynamic allocation heuristic was to avoid
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461024#comment-16461024
]
Matt Cheah commented on SPARK-24135:
I think we should not count these towards job failures, and that
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460559#comment-16460559
]
Anirudh Ramanathan commented on SPARK-24135:
+1 to detecting all pod error states and doing
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460066#comment-16460066
]
Yinan Li commented on SPARK-24135:
--
I agree that we should add detection for initialization errors. But
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460047#comment-16460047
]
Matt Cheah commented on SPARK-24135:
_> But I'm not sure how much this buys us because very likely
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459954#comment-16459954
]
Erik Erlandson commented on SPARK-24135:
I think it makes sense to detect these failure states.
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459892#comment-16459892
]
Yinan Li commented on SPARK-24135:
--
I think it's fine detecting and deleting the executor pods that
[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459754#comment-16459754
]
Matt Cheah commented on SPARK-24135:
[~foxish] [~eje] [~liyinan926] wanted to get feedback on this -
16 matches
Mail list logo