[ https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432379#comment-16432379 ]
Apache Spark commented on SPARK-23943: -------------------------------------- User 'pmackles' has created a pull request for this issue: https://github.com/apache/spark/pull/21027 > Improve observability of MesosRestServer/MesosClusterDispatcher > --------------------------------------------------------------- > > Key: SPARK-23943 > URL: https://issues.apache.org/jira/browse/SPARK-23943 > Project: Spark > Issue Type: Improvement > Components: Deploy, Mesos > Affects Versions: 2.2.1, 2.3.0 > Environment: > > Reporter: paul mackles > Priority: Minor > Fix For: 2.4.0 > > > Two changes in this PR: > * A /health endpoint for a quick binary indication on the health of > MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a > marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. > Returns a 503 status if the server is unhealthy and a 200 if the server is > healthy > * A /status endpoint for a more detailed examination on the current state of > a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool > For both endpoints, regardless of status code, the following body is returned: > > {code:java} > { > "action" : "ServerStatusResponse", > "launchedDrivers" : 0, > "message" : "iamok", > "queuedDrivers" : 0, > "schedulerDriverStopped" : false, > "serverSparkVersion" : "2.3.1-SNAPSHOT", > "success" : true, > "pendingRetryDrivers" : 0 > }{code} > Aside from surfacing all of the scheduler metrics, the response also includes > the status of the Mesos SchedulerDriver. On numerous occasions now, we have > observed scenarios where the Mesos SchedulerDriver quietly exits due to some > other failure. When this happens, jobs queue up and the only way to clean > things up is to restart the service. > With the above health check, marathon can be configured to automatically > restart the MesosClusterDispatcher service when the health check fails, > lessening the need for manual intervention. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org