Some frameworks like Aurora use custom executors to distribute the healthchecks with the tasks. This allows the task to survive a network partition without the scheduler setting it to TASK_LOST.
Marathon uses mesos-health-check for command based health checks, but does TCP and HTTP healthchecks from the elected scheduler (marathon issue #3728). On a partition event, it sets those tasks to TASK_LOST causing the master to kill them on partition heal. It also means the scheduler gets bogged down when you have many tasks with many healthchecks defined. Can this feature get a Shepard as would be useful for making mesos tasks more resilient in general? There is an open review from Haosdent for fixing it. Thanks! -- Text by Jeff, typos by iPhone

