We had a few mesos agents stuck in an unrecoverable state after a transient ZK init error. Is this a known problem? I wasn't able to find an existing jira item for this. We are on 0.24.1 at this time.
Most agents were fine, except a handful. These handful of agents had their mesos-slave process constantly restarting. The .INFO logfile had the following contents below, before the process exited, with no error messages. The restarts were happening constantly due to an existing service keep alive strategy. To fix it, we manually stopped the service, removed the data in the working dir, and then restarted it. The mesos-slave process was able to restart then. The manual intervention needed to resolve it is problematic. Here's the contents of the various log files on the agent: The .INFO logfile for one of the restarts before mesos-slave process exited with no other error messages: Log file created at: 2016/02/09 02:12:48 Running on machine: titusagent-main-i-7697a9c5 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started! I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by builds I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1 I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation: posix/cpu,posix/mem,filesystem/posix I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@ 10.138.146.230:7101 I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup: --appc_store_dir="/tmp/mesos/store/appc" --attributes="region:us-east-1;<snip>" --authenticatee="<snip>" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" <snip>" I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources: ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104 I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@ 10.138.146.230:7101) connected to ZooKeeper I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path '/titus/main/mesos' in ZooKeeper I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader: (id='209') I0209 02:12:48.520803 97284 group.cpp:674] Trying to get '/titus/main/mesos/json.info_0000000209' in ZooKeeper I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from '/mnt/data/mesos/meta' I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file '/mnt/data/mesos/meta/resources/resources.info' I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master (UPID= [email protected]:7103) is detected The .FATAL log file when the original transient ZK error occurred: Log file created at: 2016/02/05 17:21:37 Running on machine: titusagent-main-i-7697a9c5 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] The .ERROR log file: Log file created at: 2016/02/05 17:21:37 Running on machine: titusagent-main-i-7697a9c5 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] The .WARNING file had the same content.

