Tasks with failed health-checks intermittently not restarted

Jay Taylor Thu, 08 Oct 2015 12:47:05 -0700

Using the health-check following parameters:

cmd="exit 1"
delay=5.0
grace-period=10.0
interval=10.0
timeout=10.0
consecutiveFailures=3


Sometimes the tasks are successfully identified as failing and restarted,
however other times the health-check command exits yet the task is left in
a running state and the failure is ignored.

Sample of failed Mesos task log:

STDOUT:

--container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
> --stop_timeout="0ns"
> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
> --stop_timeout="0ns"
> Registered docker executor on mesos-worker2a
> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
> Launching health check process: /usr/libexec/mesos/mesos-health-check
> --executor=(1)@192.168.225.59:38776
> --health_check_json={"command":{"shell":true,"value":"docker exec
> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662
> sh -c \" exit 1
> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0}
> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>
> *Health check process launched at pid: 7525*
> *Received task health update, healthy: false**Received task health
> update, healthy: false*



STDERR:

I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave
> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
> WARNING: Your kernel does not support swap limit capabilities, memory
> limited without swap.
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health check
> still in grace period
> *W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health
> command check exited with status 1*


Screenshot of the task still running despite health-check exited with
status code 1:

http://i.imgur.com/zx9GQuo.png

The expected behavior when the health-check binary has exited w/ non-zero
status is that the task would be killed and restarted (rather than
continuing to run as outlined above).

-----
Additional note: After hard-coding the "path" string of the health-check
binary parent dir into b/src/docker/executor.cpp, I am able to at least
test the functionality.  The other issue of health-checks for docker tasks
failing to start is still unresolved due to the unpropagated
MESOS_LAUNCH_DIR issue.

Tasks with failed health-checks intermittently not restarted

Reply via email to