mesos agent not recovering after ZK init failure

Sharma Podila Tue, 09 Feb 2016 11:05:29 -0800

We had a few mesos agents stuck in an unrecoverable state after a transient
ZK init error. Is this a known problem? I wasn't able to find an existing
jira item for this. We are on 0.24.1 at this time.


Most agents were fine, except a handful. These handful of agents had their
mesos-slave process constantly restarting. The .INFO logfile had the
following contents below, before the process exited, with no error
messages. The restarts were happening constantly due to an existing service
keep alive strategy.

To fix it, we manually stopped the service, removed the data in the working
dir, and then restarted it. The mesos-slave process was able to restart
then. The manual intervention needed to resolve it is problematic.

Here's the contents of the various log files on the agent:

The .INFO logfile for one of the restarts before mesos-slave process exited
with no other error messages:

Log file created at: 2016/02/09 02:12:48
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by
builds
I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation:
posix/cpu,posix/mem,filesystem/posix
I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
10.138.146.230:7101
I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
--appc_store_dir="/tmp/mesos/store/appc"
--attributes="region:us-east-1;<snip>" --authenticatee="<snip>"
--cgroups_cpu_enable_pids_and_tids_count="false"
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
--cgroups_limit_swap="false" --cgroups_root="mesos"
--container_disk_watch_interval="15secs" --containerizers="mesos" <snip>"
I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip>
I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@
10.138.146.230:7101) connected to ZooKeeper
I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path
'/titus/main/mesos' in ZooKeeper
I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader:
(id='209')
I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
'/titus/main/mesos/json.info_0000000209' in ZooKeeper
I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
'/mnt/data/mesos/meta'
I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file
'/mnt/data/mesos/meta/resources/resources.info'
I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master (UPID=
[email protected]:7103) is detected


The .FATAL log file when the original transient ZK error occurred:

Log file created at: 2016/02/05 17:21:37
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper,
zookeeper_init: No such file or directory [2]


The .ERROR log file:

Log file created at: 2016/02/05 17:21:37
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper,
zookeeper_init: No such file or directory [2]

The .WARNING file had the same content.

mesos agent not recovering after ZK init failure

Reply via email to