Could not reproduce your problem in my side. But I guess it maybe related to this ticket. MESOS-1613 <https://issues.apache.org/jira/browse/MESOS-1613> HealthCheckTest.ConsecutiveFailures is flaky
On Fri, Oct 9, 2015 at 12:13 PM, haosdent <[email protected]> wrote: > I think it maybe because health check exit before executor receive > the TaskHealthStatus. I would try "exit 1" and give your feedback later. > > On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <[email protected]> wrote: > >> Following up on this: >> >> This problem is reproducible when the command is "exit 1". >> >> Once I set it to a real curl cmd the intermittent failures stopped and >> health checks worked as advertised. >> >> >> On Oct 8, 2015, at 12:45 PM, Jay Taylor <[email protected]> wrote: >> >> Using the health-check following parameters: >> >> cmd="exit 1" >> delay=5.0 >> grace-period=10.0 >> interval=10.0 >> timeout=10.0 >> consecutiveFailures=3 >> >> Sometimes the tasks are successfully identified as failing and restarted, >> however other times the health-check command exits yet the task is left in >> a running state and the failure is ignored. >> >> Sample of failed Mesos task log: >> >> STDOUT: >> >> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" >>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" >>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" >>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>> --stop_timeout="0ns" >>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" >>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" >>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" >>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>> --stop_timeout="0ns" >>> Registered docker executor on mesos-worker2a >>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631 >>> Launching health check process: /usr/libexec/mesos/mesos-health-check >>> --executor=(1)@192.168.225.59:38776 >>> --health_check_json={"command":{"shell":true,"value":"docker exec >>> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662 >>> sh -c \" exit 1 >>> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0} >>> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631 >>> >>> *Health check process launched at pid: 7525* >>> *Received task health update, healthy: false**Received task health >>> update, healthy: false* >> >> >> >> STDERR: >> >> I1008 19:30:02.569856 7408 exec.cpp:134] Version: 0.26.0 >>> I1008 19:30:02.571815 7411 exec.cpp:208] Executor registered on slave >>> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1 >>> WARNING: Your kernel does not support swap limit capabilities, memory >>> limited without swap. >>> WARNING: Logging before InitGoogleLogging() is written to STDERR >>> I1008 19:30:08.527354 7533 main.cpp:100] Ignoring failure as health >>> check still in grace period >>> *W1008 19:30:38.912325 7525 main.cpp:375] Health check failed Health >>> command check exited with status 1* >> >> >> Screenshot of the task still running despite health-check exited with >> status code 1: >> >> http://i.imgur.com/zx9GQuo.png >> >> The expected behavior when the health-check binary has exited w/ non-zero >> status is that the task would be killed and restarted (rather than >> continuing to run as outlined above). >> >> ----- >> Additional note: After hard-coding the "path" string of the health-check >> binary parent dir into b/src/docker/executor.cpp, I am able to at least >> test the functionality. The other issue of health-checks for docker tasks >> failing to start is still unresolved due to the unpropagated >> MESOS_LAUNCH_DIR issue. >> >> > > > -- > Best Regards, > Haosdent Huang > -- Best Regards, Haosdent Huang

