[ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432379#comment-16432379
 ] 

Apache Spark commented on SPARK-23943:
--------------------------------------

User 'pmackles' has created a pull request for this issue:
https://github.com/apache/spark/pull/21027

> Improve observability of MesosRestServer/MesosClusterDispatcher
> ---------------------------------------------------------------
>
>                 Key: SPARK-23943
>                 URL: https://issues.apache.org/jira/browse/SPARK-23943
>             Project: Spark
>          Issue Type: Improvement
>          Components: Deploy, Mesos
>    Affects Versions: 2.2.1, 2.3.0
>         Environment:  
>  
>            Reporter: paul mackles
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> Two changes in this PR:
>  * A /health endpoint for a quick binary indication on the health of 
> MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a 
> marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. 
> Returns a 503 status if the server is unhealthy and a 200 if the server is 
> healthy
>  * A /status endpoint for a more detailed examination on the current state of 
> a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool
> For both endpoints, regardless of status code, the following body is returned:
>  
> {code:java}
> {
>   "action" : "ServerStatusResponse",
>   "launchedDrivers" : 0,
>   "message" : "iamok",
>   "queuedDrivers" : 0,
>   "schedulerDriverStopped" : false,
>   "serverSparkVersion" : "2.3.1-SNAPSHOT",
>   "success" : true,
>   "pendingRetryDrivers" : 0
> }{code}
> Aside from surfacing all of the scheduler metrics, the response also includes 
> the status of the Mesos SchedulerDriver. On numerous occasions now, we have 
> observed scenarios where the Mesos SchedulerDriver quietly exits due to some 
> other failure. When this happens, jobs queue up and the only way to clean 
> things up is to restart the service. 
> With the above health check, marathon can be configured to automatically 
> restart the MesosClusterDispatcher service when the health check fails, 
> lessening the need for manual intervention.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to