I was testing on a set of VMs on a lan with ideal very low latency.
> On Oct 11, 2015, at 3:30 AM, haosdent <[email protected]> wrote: > > My stdout looks like: > > ``` > Launching health check process: > /home/ld-sgdev/huangh/mesos/build/src/.libs/mesos-health-check > --executor=(1)@xxxx > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-90f9f9ac-e2e4-4d7d-8d07-50e37f727b91-S0.d8c73ee2-deca-46b6-ab51-d7815bded2d4 > sh -c \" exit 1 > \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":20.0} > --task_id=test-health-check.901b938b-7002-11e5-a62b-0a0027000000 > Health check process launched at pid: 24720 > Received task health update, healthy: false > Received task health update, healthy: false > Received task health update, healthy: false > Killing docker task > Shutting down > ``` > > Does your network latency affect this result? > > >> On Sun, Oct 11, 2015 at 6:18 PM, haosdent <[email protected]> wrote: >> Could not reproduce your problem in my side. But I guess it maybe related to >> this ticket. MESOS-1613 HealthCheckTest.ConsecutiveFailures is flaky >> >>> On Fri, Oct 9, 2015 at 12:13 PM, haosdent <[email protected]> wrote: >>> I think it maybe because health check exit before executor receive the >>> TaskHealthStatus. I would try "exit 1" and give your feedback later. >>> >>>> On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <[email protected]> wrote: >>>> Following up on this: >>>> >>>> This problem is reproducible when the command is "exit 1". >>>> >>>> Once I set it to a real curl cmd the intermittent failures stopped and >>>> health checks worked as advertised. >>>> >>>> >>>>> On Oct 8, 2015, at 12:45 PM, Jay Taylor <[email protected]> wrote: >>>>> >>>>> Using the health-check following parameters: >>>>> >>>>> cmd="exit 1" >>>>> delay=5.0 >>>>> grace-period=10.0 >>>>> interval=10.0 >>>>> timeout=10.0 >>>>> consecutiveFailures=3 >>>>> >>>>> Sometimes the tasks are successfully identified as failing and restarted, >>>>> however other times the health-check command exits yet the task is left >>>>> in a running state and the failure is ignored. >>>>> >>>>> Sample of failed Mesos task log: >>>>> >>>>> STDOUT: >>>>> >>>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>>>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" >>>>>> --initialize_driver_logging="true" --logbufsecs="0" >>>>>> --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" >>>>>> --quiet="false" >>>>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>>>>> --stop_timeout="0ns" >>>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>>>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" >>>>>> --initialize_driver_logging="true" --logbufsecs="0" >>>>>> --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" >>>>>> --quiet="false" >>>>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>>>>> --stop_timeout="0ns" >>>>>> Registered docker executor on mesos-worker2a >>>>>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631 >>>>>> Launching health check process: /usr/libexec/mesos/mesos-health-check >>>>>> --executor=(1)@192.168.225.59:38776 >>>>>> --health_check_json={"command":{"shell":true,"value":"docker exec >>>>>> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662 >>>>>> sh -c \" exit 1 >>>>>> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0} >>>>>> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631 >>>>>> Health check process launched at pid: 7525 >>>>>> Received task health update, healthy: false >>>>>> Received task health update, healthy: false >>>>> >>>>> >>>>> STDERR: >>>>> >>>>>> I1008 19:30:02.569856 7408 exec.cpp:134] Version: 0.26.0 >>>>>> I1008 19:30:02.571815 7411 exec.cpp:208] Executor registered on slave >>>>>> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1 >>>>>> WARNING: Your kernel does not support swap limit capabilities, memory >>>>>> limited without swap. >>>>>> WARNING: Logging before InitGoogleLogging() is written to STDERR >>>>>> I1008 19:30:08.527354 7533 main.cpp:100] Ignoring failure as health >>>>>> check still in grace period >>>>>> W1008 19:30:38.912325 7525 main.cpp:375] Health check failed Health >>>>>> command check exited with status 1 >>>>> >>>>> >>>>> Screenshot of the task still running despite health-check exited with >>>>> status code 1: >>>>> >>>>> http://i.imgur.com/zx9GQuo.png >>>>> >>>>> The expected behavior when the health-check binary has exited w/ non-zero >>>>> status is that the task would be killed and restarted (rather than >>>>> continuing to run as outlined above). >>>>> >>>>> ----- >>>>> Additional note: After hard-coding the "path" string of the health-check >>>>> binary parent dir into b/src/docker/executor.cpp, I am able to at least >>>>> test the functionality. The other issue of health-checks for docker >>>>> tasks failing to start is still unresolved due to the unpropagated >>>>> MESOS_LAUNCH_DIR issue. >>> >>> >>> >>> -- >>> Best Regards, >>> Haosdent Huang >> >> >> >> -- >> Best Regards, >> Haosdent Huang > > > > -- > Best Regards, > Haosdent Huang

