Re: mesos agent not recovering after ZK init failure

Sharma Podila Fri, 15 Jul 2016 11:31:49 -0700

We had this issue happen again and were able to debug further. The cause
for agent not being able to restart is that one of the resources (disk)
changed its total size since the last restart. However, this error does not
show up in INFO/WARN/ERROR files. We saw it in stdout only when manually
restarting the agent. It would be good to have all messages going to
stdout/stderr show up in the logs. Is there a config setting for it that I
missed?


The disk size total is changing sometimes on our agents. It is off by a few
bytes (seeing ~10 bytes difference out of, say, 600 GB). We use ZFS on our
agents to manage the disk partition. From my colleague, Andrew (copied
here):

The current Mesos approach (i.e., `statvfs()` for total blocks and assume
> that never changes) won’t work reliably on ZFS
>

Anyone else experience this? We can likely hack a workaround for this by
reporting the "whole GBs" of the disk so we are insensitive to small
changes in the total size. But, not sure if the changes can be larger due
to Andrew's point above.


On Mon, Mar 7, 2016 at 6:00 PM, Sharma Podila <[email protected]> wrote:

> Sure, will do.
>
>
> On Mon, Mar 7, 2016 at 5:54 PM, Benjamin Mahler <[email protected]>
> wrote:
>
>> Very surprising.. I don't have any ideas other than trying to replicate
>> the scenario in a test.
>>
>> Please do keep us posted if you encounter it again and gain more
>> information.
>>
>> On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila <[email protected]>
>> wrote:
>>
>>> MESOS-4795 created.
>>>
>>> I don't have the exit status. We haven't seen a repeat yet, will catch
>>> the exit status next time it happens.
>>>
>>> Yes, removing the metadata directory was the only way it was resolved.
>>> This happened on multiple hosts requiring the same resolution.
>>>
>>>
>>> On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <[email protected]>
>>> wrote:
>>>
>>>> Feel free to create one. I don't have enough information to know what
>>>> the issue is without doing some further investigation, but if the situation
>>>> you described is accurate it seems like a there are two strange bugs:
>>>>
>>>> -the silent exit (do you not have the exit status?), and
>>>> -the flapping from ZK errors that needed the meta data directory to be
>>>> removed to resolve (are you convinced the removal of the meta directory is
>>>> what solved it?)
>>>>
>>>> It would be good to track these issues in case they crop up again.
>>>>
>>>> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Ben,
>>>>>
>>>>> Let me know if there is a new issue created for this, I would like to
>>>>> add myself to watch it.
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Ben,
>>>>>>
>>>>>> That is accurate, with one additional line:
>>>>>>
>>>>>> -Agent running fine with 0.24.1
>>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>>>> -ZK issue resolved
>>>>>> -Most agents stop flapping and function correctly
>>>>>> -Some agents continue flapping, but silent exit after printing the
>>>>>> detector.cpp:481 log line.
>>>>>> -The agents that continue to flap repaired with manual removal of
>>>>>> contents in mesos-slave's working dir
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Sharma,
>>>>>>>
>>>>>>> I didn't quite follow the timeline of events here or how the agent
>>>>>>> logs you posted fit into the timeline of events. Here's how I 
>>>>>>> interpreted:
>>>>>>>
>>>>>>> -Agent running fine with 0.24.1
>>>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>>>>> -ZK issue resolved
>>>>>>> -Most agents stop flapping and function correctly
>>>>>>> -Some agents continue flapping, but silent exit after printing the
>>>>>>> detector.cpp:481 log line.
>>>>>>>
>>>>>>> Is this accurate? What is the exit code from the silent exit?
>>>>>>>
>>>>>>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Maybe related, but, maybe different since a new process seems to
>>>>>>>> find the master leader and still aborts, never recovering with restarts
>>>>>>>> until work dir data is removed.
>>>>>>>> It is happening in 0.24.1.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I
>>>>>>>>> guess you are saying it is somehow related but not exactly the same 
>>>>>>>>> issue?
>>>>>>>>>
>>>>>>>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> On 9 February 2016 at 11:04, Sharma Podila <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> We had a few mesos agents stuck in an unrecoverable state after
>>>>>>>>>>> a transient ZK init error. Is this a known problem? I wasn't able 
>>>>>>>>>>> to find
>>>>>>>>>>> an existing jira item for this. We are on 0.24.1 at this time.
>>>>>>>>>>>
>>>>>>>>>>> Most agents were fine, except a handful. These handful of agents
>>>>>>>>>>> had their mesos-slave process constantly restarting. The .INFO 
>>>>>>>>>>> logfile had
>>>>>>>>>>> the following contents below, before the process exited, with no 
>>>>>>>>>>> error
>>>>>>>>>>> messages. The restarts were happening constantly due to an existing 
>>>>>>>>>>> service
>>>>>>>>>>> keep alive strategy.
>>>>>>>>>>>
>>>>>>>>>>> To fix it, we manually stopped the service, removed the data in
>>>>>>>>>>> the working dir, and then restarted it. The mesos-slave process was 
>>>>>>>>>>> able to
>>>>>>>>>>> restart then. The manual intervention needed to resolve it is 
>>>>>>>>>>> problematic.
>>>>>>>>>>>
>>>>>>>>>>> Here's the contents of the various log files on the agent:
>>>>>>>>>>>
>>>>>>>>>>> The .INFO logfile for one of the restarts before mesos-slave
>>>>>>>>>>> process exited with no other error messages:
>>>>>>>>>>>
>>>>>>>>>>> Log file created at: 2016/02/09 02:12:48
>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line]
>>>>>>>>>>> msg
>>>>>>>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging
>>>>>>>>>>> started!
>>>>>>>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30
>>>>>>>>>>> 16:12:07 by builds
>>>>>>>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
>>>>>>>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using
>>>>>>>>>>> isolation: posix/cpu,posix/mem,filesystem/posix
>>>>>>>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
>>>>>>>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
>>>>>>>>>>> 10.138.146.230:7101
>>>>>>>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
>>>>>>>>>>> --appc_store_dir="/tmp/mesos/store/appc"
>>>>>>>>>>> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>"
>>>>>>>>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>>>>>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>>>>>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>>>>>>>>> --container_disk_watch_interval="15secs" --containerizers="mesos" 
>>>>>>>>>>> <snip>"
>>>>>>>>>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
>>>>>>>>>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
>>>>>>>>>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip>
>>>>>>>>>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
>>>>>>>>>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process
>>>>>>>>>>> (group(1)@10.138.146.230:7101) connected to ZooKeeper
>>>>>>>>>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group
>>>>>>>>>>> operations: queue size (joins, cancels, datas) = (0, 0, 0)
>>>>>>>>>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path
>>>>>>>>>>> '/titus/main/mesos' in ZooKeeper
>>>>>>>>>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new
>>>>>>>>>>> leader: (id='209')
>>>>>>>>>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
>>>>>>>>>>> '/titus/main/mesos/json.info_0000000209' in ZooKeeper
>>>>>>>>>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
>>>>>>>>>>> '/mnt/data/mesos/meta'
>>>>>>>>>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find
>>>>>>>>>>> resources file '/mnt/data/mesos/meta/resources/resources.info'
>>>>>>>>>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading
>>>>>>>>>>> master ([email protected]:7103) is detected
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The .FATAL log file when the original transient ZK error
>>>>>>>>>>> occurred:
>>>>>>>>>>>
>>>>>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line]
>>>>>>>>>>> msg
>>>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>>>>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The .ERROR log file:
>>>>>>>>>>>
>>>>>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line]
>>>>>>>>>>> msg
>>>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>>>>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>>>>>>>
>>>>>>>>>>> The .WARNING file had the same content.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -rgs
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: mesos agent not recovering after ZK init failure

Reply via email to