Re: mesos agent not recovering after ZK init failure

Raúl Gutiérrez Segalés Tue, 09 Feb 2016 11:48:00 -0800

On 9 February 2016 at 11:04, Sharma Podila <[email protected]> wrote:


> We had a few mesos agents stuck in an unrecoverable state after a
> transient ZK init error. Is this a known problem? I wasn't able to find an
> existing jira item for this. We are on 0.24.1 at this time.
>
> Most agents were fine, except a handful. These handful of agents had their
> mesos-slave process constantly restarting. The .INFO logfile had the
> following contents below, before the process exited, with no error
> messages. The restarts were happening constantly due to an existing service
> keep alive strategy.
>
> To fix it, we manually stopped the service, removed the data in the
> working dir, and then restarted it. The mesos-slave process was able to
> restart then. The manual intervention needed to resolve it is problematic.
>
> Here's the contents of the various log files on the agent:
>
> The .INFO logfile for one of the restarts before mesos-slave process
> exited with no other error messages:
>
> Log file created at: 2016/02/09 02:12:48
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by
> builds
> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation:
> posix/cpu,posix/mem,filesystem/posix
> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
> 10.138.146.230:7101
> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
> --appc_store_dir="/tmp/mesos/store/appc"
> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>"
> --cgroups_cpu_enable_pids_and_tids_count="false"
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> --cgroups_limit_swap="false" --cgroups_root="mesos"
> --container_disk_watch_interval="15secs" --containerizers="mesos" <snip>"
> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip>
> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
> I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@
> 10.138.146.230:7101) connected to ZooKeeper
> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path
> '/titus/main/mesos' in ZooKeeper
> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader:
> (id='209')
> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
> '/titus/main/mesos/json.info_0000000209' in ZooKeeper
> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
> '/mnt/data/mesos/meta'
> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file
> '/mnt/data/mesos/meta/resources/resources.info'
> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master (UPID=
> [email protected]:7103) is detected
>
>
> The .FATAL log file when the original transient ZK error occurred:
>
> Log file created at: 2016/02/05 17:21:37
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper,
> zookeeper_init: No such file or directory [2]
>
>
> The .ERROR log file:
>
> Log file created at: 2016/02/05 17:21:37
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper,
> zookeeper_init: No such file or directory [2]
>
> The .WARNING file had the same content.
>

Maybe related: https://issues.apache.org/jira/browse/MESOS-1326


-rgs

Re: mesos agent not recovering after ZK init failure

Reply via email to