Re: mesos agent not recovering after ZK init failure

Jie Yu Fri, 15 Jul 2016 11:42:12 -0700

Can you hard code your disk size using --resources flag?


On Fri, Jul 15, 2016 at 11:31 AM, Sharma Podila <spod...@netflix.com> wrote:

> We had this issue happen again and were able to debug further. The cause
> for agent not being able to restart is that one of the resources (disk)
> changed its total size since the last restart. However, this error does not
> show up in INFO/WARN/ERROR files. We saw it in stdout only when manually
> restarting the agent. It would be good to have all messages going to
> stdout/stderr show up in the logs. Is there a config setting for it that I
> missed?
>
> The disk size total is changing sometimes on our agents. It is off by a
> few bytes (seeing ~10 bytes difference out of, say, 600 GB). We use ZFS on
> our agents to manage the disk partition. From my colleague, Andrew (copied
> here):
>
> The current Mesos approach (i.e., `statvfs()` for total blocks and assume
>> that never changes) won’t work reliably on ZFS
>>
>
> Anyone else experience this? We can likely hack a workaround for this by
> reporting the "whole GBs" of the disk so we are insensitive to small
> changes in the total size. But, not sure if the changes can be larger due
> to Andrew's point above.
>
>
> On Mon, Mar 7, 2016 at 6:00 PM, Sharma Podila <spod...@netflix.com> wrote:
>
>> Sure, will do.
>>
>>
>> On Mon, Mar 7, 2016 at 5:54 PM, Benjamin Mahler <bmah...@apache.org>
>> wrote:
>>
>>> Very surprising.. I don't have any ideas other than trying to replicate
>>> the scenario in a test.
>>>
>>> Please do keep us posted if you encounter it again and gain more
>>> information.
>>>
>>> On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila <spod...@netflix.com>
>>> wrote:
>>>
>>>> MESOS-4795 created.
>>>>
>>>> I don't have the exit status. We haven't seen a repeat yet, will catch
>>>> the exit status next time it happens.
>>>>
>>>> Yes, removing the metadata directory was the only way it was resolved.
>>>> This happened on multiple hosts requiring the same resolution.
>>>>
>>>>
>>>> On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <bmah...@apache.org>
>>>> wrote:
>>>>
>>>>> Feel free to create one. I don't have enough information to know what
>>>>> the issue is without doing some further investigation, but if the 
>>>>> situation
>>>>> you described is accurate it seems like a there are two strange bugs:
>>>>>
>>>>> -the silent exit (do you not have the exit status?), and
>>>>> -the flapping from ZK errors that needed the meta data directory to be
>>>>> removed to resolve (are you convinced the removal of the meta directory is
>>>>> what solved it?)
>>>>>
>>>>> It would be good to track these issues in case they crop up again.
>>>>>
>>>>> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <spod...@netflix.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ben,
>>>>>>
>>>>>> Let me know if there is a new issue created for this, I would like to
>>>>>> add myself to watch it.
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <spod...@netflix.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ben,
>>>>>>>
>>>>>>> That is accurate, with one additional line:
>>>>>>>
>>>>>>> -Agent running fine with 0.24.1
>>>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>>>>> -ZK issue resolved
>>>>>>> -Most agents stop flapping and function correctly
>>>>>>> -Some agents continue flapping, but silent exit after printing the
>>>>>>> detector.cpp:481 log line.
>>>>>>> -The agents that continue to flap repaired with manual removal of
>>>>>>> contents in mesos-slave's working dir
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <bmah...@apache.org
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hey Sharma,
>>>>>>>>
>>>>>>>> I didn't quite follow the timeline of events here or how the agent
>>>>>>>> logs you posted fit into the timeline of events. Here's how I 
>>>>>>>> interpreted:
>>>>>>>>
>>>>>>>> -Agent running fine with 0.24.1
>>>>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>>>>>> -ZK issue resolved
>>>>>>>> -Most agents stop flapping and function correctly
>>>>>>>> -Some agents continue flapping, but silent exit after printing the
>>>>>>>> detector.cpp:481 log line.
>>>>>>>>
>>>>>>>> Is this accurate? What is the exit code from the silent exit?
>>>>>>>>
>>>>>>>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <spod...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Maybe related, but, maybe different since a new process seems to
>>>>>>>>> find the master leader and still aborts, never recovering with 
>>>>>>>>> restarts
>>>>>>>>> until work dir data is removed.
>>>>>>>>> It is happening in 0.24.1.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <vinodk...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I
>>>>>>>>>> guess you are saying it is somehow related but not exactly the same 
>>>>>>>>>> issue?
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
>>>>>>>>>> r...@itevenworks.net> wrote:
>>>>>>>>>>
>>>>>>>>>>> On 9 February 2016 at 11:04, Sharma Podila <spod...@netflix.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> We had a few mesos agents stuck in an unrecoverable state after
>>>>>>>>>>>> a transient ZK init error. Is this a known problem? I wasn't able 
>>>>>>>>>>>> to find
>>>>>>>>>>>> an existing jira item for this. We are on 0.24.1 at this time.
>>>>>>>>>>>>
>>>>>>>>>>>> Most agents were fine, except a handful. These handful of
>>>>>>>>>>>> agents had their mesos-slave process constantly restarting. The 
>>>>>>>>>>>> .INFO
>>>>>>>>>>>> logfile had the following contents below, before the process 
>>>>>>>>>>>> exited, with
>>>>>>>>>>>> no error messages. The restarts were happening constantly due to an
>>>>>>>>>>>> existing service keep alive strategy.
>>>>>>>>>>>>
>>>>>>>>>>>> To fix it, we manually stopped the service, removed the data in
>>>>>>>>>>>> the working dir, and then restarted it. The mesos-slave process 
>>>>>>>>>>>> was able to
>>>>>>>>>>>> restart then. The manual intervention needed to resolve it is 
>>>>>>>>>>>> problematic.
>>>>>>>>>>>>
>>>>>>>>>>>> Here's the contents of the various log files on the agent:
>>>>>>>>>>>>
>>>>>>>>>>>> The .INFO logfile for one of the restarts before mesos-slave
>>>>>>>>>>>> process exited with no other error messages:
>>>>>>>>>>>>
>>>>>>>>>>>> Log file created at: 2016/02/09 02:12:48
>>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line]
>>>>>>>>>>>> msg
>>>>>>>>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging
>>>>>>>>>>>> started!
>>>>>>>>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30
>>>>>>>>>>>> 16:12:07 by builds
>>>>>>>>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
>>>>>>>>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using
>>>>>>>>>>>> isolation: posix/cpu,posix/mem,filesystem/posix
>>>>>>>>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
>>>>>>>>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
>>>>>>>>>>>> 10.138.146.230:7101
>>>>>>>>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
>>>>>>>>>>>> --appc_store_dir="/tmp/mesos/store/appc"
>>>>>>>>>>>> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>"
>>>>>>>>>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>>>>>>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>>>>>>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>>>>>>>>>> --container_disk_watch_interval="15secs" --containerizers="mesos" 
>>>>>>>>>>>> <snip>"
>>>>>>>>>>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
>>>>>>>>>>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
>>>>>>>>>>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname:
>>>>>>>>>>>> <snip>
>>>>>>>>>>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint:
>>>>>>>>>>>> true
>>>>>>>>>>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process
>>>>>>>>>>>> (group(1)@10.138.146.230:7101) connected to ZooKeeper
>>>>>>>>>>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group
>>>>>>>>>>>> operations: queue size (joins, cancels, datas) = (0, 0, 0)
>>>>>>>>>>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create
>>>>>>>>>>>> path '/titus/main/mesos' in ZooKeeper
>>>>>>>>>>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new
>>>>>>>>>>>> leader: (id='209')
>>>>>>>>>>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
>>>>>>>>>>>> '/titus/main/mesos/json.info_0000000209' in ZooKeeper
>>>>>>>>>>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
>>>>>>>>>>>> '/mnt/data/mesos/meta'
>>>>>>>>>>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find
>>>>>>>>>>>> resources file '/mnt/data/mesos/meta/resources/resources.info'
>>>>>>>>>>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading
>>>>>>>>>>>> master (UPID=master@10.230.95.110:7103) is detected
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The .FATAL log file when the original transient ZK error
>>>>>>>>>>>> occurred:
>>>>>>>>>>>>
>>>>>>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line]
>>>>>>>>>>>> msg
>>>>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>>>>>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The .ERROR log file:
>>>>>>>>>>>>
>>>>>>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line]
>>>>>>>>>>>> msg
>>>>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>>>>>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>>>>>>>>
>>>>>>>>>>>> The .WARNING file had the same content.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -rgs
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: mesos agent not recovering after ZK init failure

Reply via email to