Interesting, I just got the new cluster up. I'll re-run both these scenarios and let you know what I find.
What you're saying makes a lot of sense- many different versions of Mesos were installed on the old cluster! Hopefully that is the cause of the strangeness. Thanks again for all your help Haosdent!!! > On Oct 11, 2015, at 11:17 AM, haosdent <[email protected]> wrote: > > I guess your environment have some problems, just like your met strange > behaviour in launcher_dir. Last time I remember also met strange things in my > laptop, it is because I make install a older version Mesos while I launch a > new version Mesos. If you could `make uninstall` and fresh build through > following http://mesos.apache.org/documentation/latest/getting-started/ and > still met those problems, we could more sure it is a bug. But so far I could > not reproduce your problems after repeat tests, I still think it maybe cause > by your build environment is not clear or have error configurations. > >> On Sun, Oct 11, 2015 at 11:02 PM, Jay Taylor <[email protected]> wrote: >> I was testing on a set of VMs on a lan with ideal very low latency. >> >> >> >>> On Oct 11, 2015, at 3:30 AM, haosdent <[email protected]> wrote: >>> >>> My stdout looks like: >>> >>> ``` >>> Launching health check process: >>> /home/ld-sgdev/huangh/mesos/build/src/.libs/mesos-health-check >>> --executor=(1)@xxxx >>> --health_check_json={"command":{"shell":true,"value":"docker exec >>> mesos-90f9f9ac-e2e4-4d7d-8d07-50e37f727b91-S0.d8c73ee2-deca-46b6-ab51-d7815bded2d4 >>> sh -c \" exit 1 >>> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":20.0} >>> --task_id=test-health-check.901b938b-7002-11e5-a62b-0a0027000000 >>> Health check process launched at pid: 24720 >>> Received task health update, healthy: false >>> Received task health update, healthy: false >>> Received task health update, healthy: false >>> Killing docker task >>> Shutting down >>> ``` >>> >>> Does your network latency affect this result? >>> >>> >>>> On Sun, Oct 11, 2015 at 6:18 PM, haosdent <[email protected]> wrote: >>>> Could not reproduce your problem in my side. But I guess it maybe related >>>> to this ticket. MESOS-1613 HealthCheckTest.ConsecutiveFailures is flaky >>>> >>>>> On Fri, Oct 9, 2015 at 12:13 PM, haosdent <[email protected]> wrote: >>>>> I think it maybe because health check exit before executor receive the >>>>> TaskHealthStatus. I would try "exit 1" and give your feedback later. >>>>> >>>>>> On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <[email protected]> wrote: >>>>>> Following up on this: >>>>>> >>>>>> This problem is reproducible when the command is "exit 1". >>>>>> >>>>>> Once I set it to a real curl cmd the intermittent failures stopped and >>>>>> health checks worked as advertised. >>>>>> >>>>>> >>>>>>> On Oct 8, 2015, at 12:45 PM, Jay Taylor <[email protected]> wrote: >>>>>>> >>>>>>> Using the health-check following parameters: >>>>>>> >>>>>>> cmd="exit 1" >>>>>>> delay=5.0 >>>>>>> grace-period=10.0 >>>>>>> interval=10.0 >>>>>>> timeout=10.0 >>>>>>> consecutiveFailures=3 >>>>>>> >>>>>>> Sometimes the tasks are successfully identified as failing and >>>>>>> restarted, however other times the health-check command exits yet the >>>>>>> task is left in a running state and the failure is ignored. >>>>>>> >>>>>>> Sample of failed Mesos task log: >>>>>>> >>>>>>> STDOUT: >>>>>>> >>>>>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>>>>>>> --docker="docker" --docker_socket="/var/run/docker.sock" >>>>>>>> --help="false" --initialize_driver_logging="true" --logbufsecs="0" >>>>>>>> --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" >>>>>>>> --quiet="false" >>>>>>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>>>>>>> --stop_timeout="0ns" >>>>>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>>>>>>> --docker="docker" --docker_socket="/var/run/docker.sock" >>>>>>>> --help="false" --initialize_driver_logging="true" --logbufsecs="0" >>>>>>>> --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" >>>>>>>> --quiet="false" >>>>>>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662" >>>>>>>> --stop_timeout="0ns" >>>>>>>> Registered docker executor on mesos-worker2a >>>>>>>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631 >>>>>>>> Launching health check process: /usr/libexec/mesos/mesos-health-check >>>>>>>> --executor=(1)@192.168.225.59:38776 >>>>>>>> --health_check_json={"command":{"shell":true,"value":"docker exec >>>>>>>> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662 >>>>>>>> sh -c \" exit 1 >>>>>>>> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0} >>>>>>>> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631 >>>>>>>> Health check process launched at pid: 7525 >>>>>>>> Received task health update, healthy: false >>>>>>>> Received task health update, healthy: false >>>>>>> >>>>>>> >>>>>>> STDERR: >>>>>>> >>>>>>>> I1008 19:30:02.569856 7408 exec.cpp:134] Version: 0.26.0 >>>>>>>> I1008 19:30:02.571815 7411 exec.cpp:208] Executor registered on slave >>>>>>>> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1 >>>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory >>>>>>>> limited without swap. >>>>>>>> WARNING: Logging before InitGoogleLogging() is written to STDERR >>>>>>>> I1008 19:30:08.527354 7533 main.cpp:100] Ignoring failure as health >>>>>>>> check still in grace period >>>>>>>> W1008 19:30:38.912325 7525 main.cpp:375] Health check failed Health >>>>>>>> command check exited with status 1 >>>>>>> >>>>>>> >>>>>>> Screenshot of the task still running despite health-check exited with >>>>>>> status code 1: >>>>>>> >>>>>>> http://i.imgur.com/zx9GQuo.png >>>>>>> >>>>>>> The expected behavior when the health-check binary has exited w/ >>>>>>> non-zero status is that the task would be killed and restarted (rather >>>>>>> than continuing to run as outlined above). >>>>>>> >>>>>>> ----- >>>>>>> Additional note: After hard-coding the "path" string of the >>>>>>> health-check binary parent dir into b/src/docker/executor.cpp, I am >>>>>>> able to at least test the functionality. The other issue of >>>>>>> health-checks for docker tasks failing to start is still unresolved due >>>>>>> to the unpropagated MESOS_LAUNCH_DIR issue. >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Haosdent Huang >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Haosdent Huang >>> >>> >>> >>> -- >>> Best Regards, >>> Haosdent Huang > > > > -- > Best Regards, > Haosdent Huang

