Re: mesos agent not recovering after ZK init failure

Benjamin Mahler Thu, 25 Feb 2016 18:38:11 -0800

Feel free to create one. I don't have enough information to know what the
issue is without doing some further investigation, but if the situation you
described is accurate it seems like a there are two strange bugs:


-the silent exit (do you not have the exit status?), and
-the flapping from ZK errors that needed the meta data directory to be
removed to resolve (are you convinced the removal of the meta directory is
what solved it?)

It would be good to track these issues in case they crop up again.

On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <spod...@netflix.com> wrote:

> Hi Ben,
>
> Let me know if there is a new issue created for this, I would like to add
> myself to watch it.
> Thanks.
>
>
>
> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <spod...@netflix.com>
> wrote:
>
>> Hi Ben,
>>
>> That is accurate, with one additional line:
>>
>> -Agent running fine with 0.24.1
>> -Transient ZK issues, slave flapping with zookeeper_init failure
>> -ZK issue resolved
>> -Most agents stop flapping and function correctly
>> -Some agents continue flapping, but silent exit after printing the
>> detector.cpp:481 log line.
>> -The agents that continue to flap repaired with manual removal of
>> contents in mesos-slave's working dir
>>
>>
>>
>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <bmah...@apache.org>
>> wrote:
>>
>>> Hey Sharma,
>>>
>>> I didn't quite follow the timeline of events here or how the agent logs
>>> you posted fit into the timeline of events. Here's how I interpreted:
>>>
>>> -Agent running fine with 0.24.1
>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>> -ZK issue resolved
>>> -Most agents stop flapping and function correctly
>>> -Some agents continue flapping, but silent exit after printing the
>>> detector.cpp:481 log line.
>>>
>>> Is this accurate? What is the exit code from the silent exit?
>>>
>>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <spod...@netflix.com>
>>> wrote:
>>>
>>>> Maybe related, but, maybe different since a new process seems to find
>>>> the master leader and still aborts, never recovering with restarts until
>>>> work dir data is removed.
>>>> It is happening in 0.24.1.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <vinodk...@apache.org>
>>>> wrote:
>>>>
>>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I guess
>>>>> you are saying it is somehow related but not exactly the same issue?
>>>>>
>>>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
>>>>> r...@itevenworks.net> wrote:
>>>>>
>>>>>> On 9 February 2016 at 11:04, Sharma Podila <spod...@netflix.com>
>>>>>> wrote:
>>>>>>
>>>>>>> We had a few mesos agents stuck in an unrecoverable state after a
>>>>>>> transient ZK init error. Is this a known problem? I wasn't able to find 
>>>>>>> an
>>>>>>> existing jira item for this. We are on 0.24.1 at this time.
>>>>>>>
>>>>>>> Most agents were fine, except a handful. These handful of agents had
>>>>>>> their mesos-slave process constantly restarting. The .INFO logfile had 
>>>>>>> the
>>>>>>> following contents below, before the process exited, with no error
>>>>>>> messages. The restarts were happening constantly due to an existing 
>>>>>>> service
>>>>>>> keep alive strategy.
>>>>>>>
>>>>>>> To fix it, we manually stopped the service, removed the data in the
>>>>>>> working dir, and then restarted it. The mesos-slave process was able to
>>>>>>> restart then. The manual intervention needed to resolve it is 
>>>>>>> problematic.
>>>>>>>
>>>>>>> Here's the contents of the various log files on the agent:
>>>>>>>
>>>>>>> The .INFO logfile for one of the restarts before mesos-slave process
>>>>>>> exited with no other error messages:
>>>>>>>
>>>>>>> Log file created at: 2016/02/09 02:12:48
>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
>>>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging
>>>>>>> started!
>>>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07
>>>>>>> by builds
>>>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
>>>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation:
>>>>>>> posix/cpu,posix/mem,filesystem/posix
>>>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
>>>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
>>>>>>> 10.138.146.230:7101
>>>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
>>>>>>> --appc_store_dir="/tmp/mesos/store/appc"
>>>>>>> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>"
>>>>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>>>>> --container_disk_watch_interval="15secs" --containerizers="mesos" 
>>>>>>> <snip>"
>>>>>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
>>>>>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
>>>>>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip>
>>>>>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
>>>>>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@
>>>>>>> 10.138.146.230:7101) connected to ZooKeeper
>>>>>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations:
>>>>>>> queue size (joins, cancels, datas) = (0, 0, 0)
>>>>>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path
>>>>>>> '/titus/main/mesos' in ZooKeeper
>>>>>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader:
>>>>>>> (id='209')
>>>>>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
>>>>>>> '/titus/main/mesos/json.info_0000000209' in ZooKeeper
>>>>>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
>>>>>>> '/mnt/data/mesos/meta'
>>>>>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources
>>>>>>> file '/mnt/data/mesos/meta/resources/resources.info'
>>>>>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master
>>>>>>> (UPID=master@10.230.95.110:7103) is detected
>>>>>>>
>>>>>>>
>>>>>>> The .FATAL log file when the original transient ZK error occurred:
>>>>>>>
>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>>>
>>>>>>>
>>>>>>> The .ERROR log file:
>>>>>>>
>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>>>
>>>>>>> The .WARNING file had the same content.
>>>>>>>
>>>>>>
>>>>>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326
>>>>>>
>>>>>>
>>>>>> -rgs
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: mesos agent not recovering after ZK init failure

Reply via email to