Some frameworks like Aurora use custom executors to distribute the
healthchecks with the tasks. This allows the task to survive a network
partition without the scheduler setting it to TASK_LOST.

Marathon uses mesos-health-check for command based health checks, but does
TCP and HTTP healthchecks from the elected scheduler (marathon issue
#3728). On a partition event, it sets those tasks to TASK_LOST causing the
master to kill them on partition heal. It also means the scheduler gets
bogged down when you have many tasks with many healthchecks defined.

Can this feature get a Shepard as would be useful for making mesos tasks
more resilient in general? There is an open review from Haosdent for fixing
it.

Thanks!


-- 
Text by Jeff, typos by iPhone

Reply via email to