[jira] [Commented] (SPARK-17519) [MESOS] Enhance robustness when ExternalShuffleService is broken

Igor Berman (JIRA) Thu, 12 Apr 2018 03:26:38 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435305#comment-16435305
 ]


Igor Berman commented on SPARK-17519:
-------------------------------------

Additional usecase: when driver can't connect to one of the external shuffle 
services it aborts on Mesos, but continues to be alive, so from Mesos 
perspective framework enters inactive mode(which needs manual restart)

> [MESOS] Enhance robustness when ExternalShuffleService is broken
> ----------------------------------------------------------------
>
>                 Key: SPARK-17519
>                 URL: https://issues.apache.org/jira/browse/SPARK-17519
>             Project: Spark
>          Issue Type: Improvement
>          Components: Mesos
>    Affects Versions: 2.0.0
>            Reporter: Sun Rui
>            Priority: Major
>
> This is intended to be a complement to SPARK-17370 which addressed Standalone 
> mode only.
> For Mesos, it seems we could enhance MesosExternalShuffleClient to detect if 
> any of the external shuffle services is lost when sending heartbeats. In such 
> case, the MesosCoarseGrainedSchedulerBackend can notify ExecutorLost with 
> workerlost=true. Also it can put the slave where the external shuffle service 
> run to the blacklist, preventing launching tasks further on it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17519) [MESOS] Enhance robustness when ExternalShuffleService is broken

Reply via email to