Re: mesos agent not recovering after ZK init failure

Sharma Podila Fri, 15 Jul 2016 12:23:42 -0700

Vinod,

MESOS-5854 <https://issues.apache.org/jira/browse/MESOS-5854> created. Feel
free to change the priority appropriately.


Yes, the workaround I mentioned for disk size is based on resource
specification, so that works for now.


On Fri, Jul 15, 2016 at 11:48 AM, Andrew Leung <[email protected]> wrote:

> Hi Jie,
>
> Yes, that is how we are working around this issue. However, we wanted to
> see if others were hitting this issue as well. If others had a similar
> Mesos Slave on ZFS setup, it might be worth considering a disk space
> calculation approach that works more reliably with ZFS or at least calling
> out the need to specify the disk resource explicitly.
>
> Thanks for the help.
> Andrew
>
> On Jul 15, 2016, at 11:41 AM, Jie Yu <[email protected]> wrote:
>
> Can you hard code your disk size using --resources flag?
>
>
> On Fri, Jul 15, 2016 at 11:31 AM, Sharma Podila <[email protected]>
> wrote:
>
>> We had this issue happen again and were able to debug further. The cause
>> for agent not being able to restart is that one of the resources (disk)
>> changed its total size since the last restart. However, this error does not
>> show up in INFO/WARN/ERROR files. We saw it in stdout only when manually
>> restarting the agent. It would be good to have all messages going to
>> stdout/stderr show up in the logs. Is there a config setting for it that I
>> missed?
>>
>> The disk size total is changing sometimes on our agents. It is off by a
>> few bytes (seeing ~10 bytes difference out of, say, 600 GB). We use ZFS on
>> our agents to manage the disk partition. From my colleague, Andrew (copied
>> here):
>>
>> The current Mesos approach (i.e., `statvfs()` for total blocks and assume
>>> that never changes) won’t work reliably on ZFS
>>>
>>
>> Anyone else experience this? We can likely hack a workaround for this by
>> reporting the "whole GBs" of the disk so we are insensitive to small
>> changes in the total size. But, not sure if the changes can be larger due
>> to Andrew's point above.
>>
>>
>> On Mon, Mar 7, 2016 at 6:00 PM, Sharma Podila <[email protected]>
>> wrote:
>>
>>> Sure, will do.
>>>
>>>
>>> On Mon, Mar 7, 2016 at 5:54 PM, Benjamin Mahler <[email protected]>
>>> wrote:
>>>
>>>> Very surprising.. I don't have any ideas other than trying to replicate
>>>> the scenario in a test.
>>>>
>>>> Please do keep us posted if you encounter it again and gain more
>>>> information.
>>>>
>>>> On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila <[email protected]>
>>>> wrote:
>>>>
>>>>> MESOS-4795 created.
>>>>>
>>>>> I don't have the exit status. We haven't seen a repeat yet, will catch
>>>>> the exit status next time it happens.
>>>>>
>>>>> Yes, removing the metadata directory was the only way it was resolved.
>>>>> This happened on multiple hosts requiring the same resolution.
>>>>>
>>>>>
>>>>> On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Feel free to create one. I don't have enough information to know what
>>>>>> the issue is without doing some further investigation, but if the 
>>>>>> situation
>>>>>> you described is accurate it seems like a there are two strange bugs:
>>>>>>
>>>>>> -the silent exit (do you not have the exit status?), and
>>>>>> -the flapping from ZK errors that needed the meta data directory to
>>>>>> be removed to resolve (are you convinced the removal of the meta 
>>>>>> directory
>>>>>> is what solved it?)
>>>>>>
>>>>>> It would be good to track these issues in case they crop up again.
>>>>>>
>>>>>> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ben,
>>>>>>>
>>>>>>> Let me know if there is a new issue created for this, I would like
>>>>>>> to add myself to watch it.
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Ben,
>>>>>>>>
>>>>>>>> That is accurate, with one additional line:
>>>>>>>>
>>>>>>>> -Agent running fine with 0.24.1
>>>>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>>>>>> -ZK issue resolved
>>>>>>>> -Most agents stop flapping and function correctly
>>>>>>>> -Some agents continue flapping, but silent exit after printing the
>>>>>>>> detector.cpp:481 log line.
>>>>>>>> -The agents that continue to flap repaired with manual removal of
>>>>>>>> contents in mesos-slave's working dir
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hey Sharma,
>>>>>>>>>
>>>>>>>>> I didn't quite follow the timeline of events here or how the agent
>>>>>>>>> logs you posted fit into the timeline of events. Here's how I 
>>>>>>>>> interpreted:
>>>>>>>>>
>>>>>>>>> -Agent running fine with 0.24.1
>>>>>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>>>>>>> -ZK issue resolved
>>>>>>>>> -Most agents stop flapping and function correctly
>>>>>>>>> -Some agents continue flapping, but silent exit after printing the
>>>>>>>>> detector.cpp:481 log line.
>>>>>>>>>
>>>>>>>>> Is this accurate? What is the exit code from the silent exit?
>>>>>>>>>
>>>>>>>>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <[email protected]
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Maybe related, but, maybe different since a new process seems to
>>>>>>>>>> find the master leader and still aborts, never recovering with 
>>>>>>>>>> restarts
>>>>>>>>>> until work dir data is removed.
>>>>>>>>>> It is happening in 0.24.1.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <[email protected]
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I
>>>>>>>>>>> guess you are saying it is somehow related but not exactly the same 
>>>>>>>>>>> issue?
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 9 February 2016 at 11:04, Sharma Podila <[email protected]
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> We had a few mesos agents stuck in an unrecoverable state
>>>>>>>>>>>>> after a transient ZK init error. Is this a known problem? I 
>>>>>>>>>>>>> wasn't able to
>>>>>>>>>>>>> find an existing jira item for this. We are on 0.24.1 at this 
>>>>>>>>>>>>> time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Most agents were fine, except a handful. These handful of
>>>>>>>>>>>>> agents had their mesos-slave process constantly restarting. The 
>>>>>>>>>>>>> .INFO
>>>>>>>>>>>>> logfile had the following contents below, before the process 
>>>>>>>>>>>>> exited, with
>>>>>>>>>>>>> no error messages. The restarts were happening constantly due to 
>>>>>>>>>>>>> an
>>>>>>>>>>>>> existing service keep alive strategy.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To fix it, we manually stopped the service, removed the data
>>>>>>>>>>>>> in the working dir, and then restarted it. The mesos-slave 
>>>>>>>>>>>>> process was able
>>>>>>>>>>>>> to restart then. The manual intervention needed to resolve it is
>>>>>>>>>>>>> problematic.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here's the contents of the various log files on the agent:
>>>>>>>>>>>>>
>>>>>>>>>>>>> The .INFO logfile for one of the restarts before mesos-slave
>>>>>>>>>>>>> process exited with no other error messages:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Log file created at: 2016/02/09 02:12:48
>>>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid
>>>>>>>>>>>>> file:line] msg
>>>>>>>>>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level
>>>>>>>>>>>>> logging started!
>>>>>>>>>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30
>>>>>>>>>>>>> 16:12:07 by builds
>>>>>>>>>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
>>>>>>>>>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using
>>>>>>>>>>>>> isolation: posix/cpu,posix/mem,filesystem/posix
>>>>>>>>>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
>>>>>>>>>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
>>>>>>>>>>>>> 10.138.146.230:7101
>>>>>>>>>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
>>>>>>>>>>>>> --appc_store_dir="/tmp/mesos/store/appc"
>>>>>>>>>>>>> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>"
>>>>>>>>>>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>>>>>>>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>>>>>>>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>>>>>>>>>>> --container_disk_watch_interval="15secs" --containerizers="mesos" 
>>>>>>>>>>>>> <snip>"
>>>>>>>>>>>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
>>>>>>>>>>>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
>>>>>>>>>>>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname:
>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint:
>>>>>>>>>>>>> true
>>>>>>>>>>>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process
>>>>>>>>>>>>> (group(1)@10.138.146.230:7101) connected to ZooKeeper
>>>>>>>>>>>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group
>>>>>>>>>>>>> operations: queue size (joins, cancels, datas) = (0, 0, 0)
>>>>>>>>>>>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create
>>>>>>>>>>>>> path '/titus/main/mesos' in ZooKeeper
>>>>>>>>>>>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new
>>>>>>>>>>>>> leader: (id='209')
>>>>>>>>>>>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
>>>>>>>>>>>>> '/titus/main/mesos/json.info_0000000209' in ZooKeeper
>>>>>>>>>>>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state
>>>>>>>>>>>>> from '/mnt/data/mesos/meta'
>>>>>>>>>>>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find
>>>>>>>>>>>>> resources file '/mnt/data/mesos/meta/resources/resources.info'
>>>>>>>>>>>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading
>>>>>>>>>>>>> master ([email protected]:7103) is detected
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The .FATAL log file when the original transient ZK error
>>>>>>>>>>>>> occurred:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid
>>>>>>>>>>>>> file:line] msg
>>>>>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to
>>>>>>>>>>>>> create ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The .ERROR log file:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid
>>>>>>>>>>>>> file:line] msg
>>>>>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to
>>>>>>>>>>>>> create ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>>>>>>>>>
>>>>>>>>>>>>> The .WARNING file had the same content.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -rgs
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: mesos agent not recovering after ZK init failure

Reply via email to