Re: mesos agent not recovering after ZK init failure

Sharma Podila Mon, 07 Mar 2016 18:00:24 -0800

Sure, will do.


On Mon, Mar 7, 2016 at 5:54 PM, Benjamin Mahler <[email protected]> wrote:

> Very surprising.. I don't have any ideas other than trying to replicate
> the scenario in a test.
>
> Please do keep us posted if you encounter it again and gain more
> information.
>
> On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila <[email protected]>
> wrote:
>
>> MESOS-4795 created.
>>
>> I don't have the exit status. We haven't seen a repeat yet, will catch
>> the exit status next time it happens.
>>
>> Yes, removing the metadata directory was the only way it was resolved.
>> This happened on multiple hosts requiring the same resolution.
>>
>>
>> On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <[email protected]>
>> wrote:
>>
>>> Feel free to create one. I don't have enough information to know what
>>> the issue is without doing some further investigation, but if the situation
>>> you described is accurate it seems like a there are two strange bugs:
>>>
>>> -the silent exit (do you not have the exit status?), and
>>> -the flapping from ZK errors that needed the meta data directory to be
>>> removed to resolve (are you convinced the removal of the meta directory is
>>> what solved it?)
>>>
>>> It would be good to track these issues in case they crop up again.
>>>
>>> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <[email protected]>
>>> wrote:
>>>
>>>> Hi Ben,
>>>>
>>>> Let me know if there is a new issue created for this, I would like to
>>>> add myself to watch it.
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Ben,
>>>>>
>>>>> That is accurate, with one additional line:
>>>>>
>>>>> -Agent running fine with 0.24.1
>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>>> -ZK issue resolved
>>>>> -Most agents stop flapping and function correctly
>>>>> -Some agents continue flapping, but silent exit after printing the
>>>>> detector.cpp:481 log line.
>>>>> -The agents that continue to flap repaired with manual removal of
>>>>> contents in mesos-slave's working dir
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hey Sharma,
>>>>>>
>>>>>> I didn't quite follow the timeline of events here or how the agent
>>>>>> logs you posted fit into the timeline of events. Here's how I 
>>>>>> interpreted:
>>>>>>
>>>>>> -Agent running fine with 0.24.1
>>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>>>> -ZK issue resolved
>>>>>> -Most agents stop flapping and function correctly
>>>>>> -Some agents continue flapping, but silent exit after printing the
>>>>>> detector.cpp:481 log line.
>>>>>>
>>>>>> Is this accurate? What is the exit code from the silent exit?
>>>>>>
>>>>>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Maybe related, but, maybe different since a new process seems to
>>>>>>> find the master leader and still aborts, never recovering with restarts
>>>>>>> until work dir data is removed.
>>>>>>> It is happening in 0.24.1.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I
>>>>>>>> guess you are saying it is somehow related but not exactly the same 
>>>>>>>> issue?
>>>>>>>>
>>>>>>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> On 9 February 2016 at 11:04, Sharma Podila <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> We had a few mesos agents stuck in an unrecoverable state after a
>>>>>>>>>> transient ZK init error. Is this a known problem? I wasn't able to 
>>>>>>>>>> find an
>>>>>>>>>> existing jira item for this. We are on 0.24.1 at this time.
>>>>>>>>>>
>>>>>>>>>> Most agents were fine, except a handful. These handful of agents
>>>>>>>>>> had their mesos-slave process constantly restarting. The .INFO 
>>>>>>>>>> logfile had
>>>>>>>>>> the following contents below, before the process exited, with no 
>>>>>>>>>> error
>>>>>>>>>> messages. The restarts were happening constantly due to an existing 
>>>>>>>>>> service
>>>>>>>>>> keep alive strategy.
>>>>>>>>>>
>>>>>>>>>> To fix it, we manually stopped the service, removed the data in
>>>>>>>>>> the working dir, and then restarted it. The mesos-slave process was 
>>>>>>>>>> able to
>>>>>>>>>> restart then. The manual intervention needed to resolve it is 
>>>>>>>>>> problematic.
>>>>>>>>>>
>>>>>>>>>> Here's the contents of the various log files on the agent:
>>>>>>>>>>
>>>>>>>>>> The .INFO logfile for one of the restarts before mesos-slave
>>>>>>>>>> process exited with no other error messages:
>>>>>>>>>>
>>>>>>>>>> Log file created at: 2016/02/09 02:12:48
>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line]
>>>>>>>>>> msg
>>>>>>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging
>>>>>>>>>> started!
>>>>>>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30
>>>>>>>>>> 16:12:07 by builds
>>>>>>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
>>>>>>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using
>>>>>>>>>> isolation: posix/cpu,posix/mem,filesystem/posix
>>>>>>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
>>>>>>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
>>>>>>>>>> 10.138.146.230:7101
>>>>>>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
>>>>>>>>>> --appc_store_dir="/tmp/mesos/store/appc"
>>>>>>>>>> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>"
>>>>>>>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>>>>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>>>>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>>>>>>>> --container_disk_watch_interval="15secs" --containerizers="mesos" 
>>>>>>>>>> <snip>"
>>>>>>>>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
>>>>>>>>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
>>>>>>>>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip>
>>>>>>>>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
>>>>>>>>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process
>>>>>>>>>> (group(1)@10.138.146.230:7101) connected to ZooKeeper
>>>>>>>>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group
>>>>>>>>>> operations: queue size (joins, cancels, datas) = (0, 0, 0)
>>>>>>>>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path
>>>>>>>>>> '/titus/main/mesos' in ZooKeeper
>>>>>>>>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new
>>>>>>>>>> leader: (id='209')
>>>>>>>>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
>>>>>>>>>> '/titus/main/mesos/json.info_0000000209' in ZooKeeper
>>>>>>>>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
>>>>>>>>>> '/mnt/data/mesos/meta'
>>>>>>>>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find
>>>>>>>>>> resources file '/mnt/data/mesos/meta/resources/resources.info'
>>>>>>>>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading
>>>>>>>>>> master ([email protected]:7103) is detected
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The .FATAL log file when the original transient ZK error occurred:
>>>>>>>>>>
>>>>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line]
>>>>>>>>>> msg
>>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>>>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The .ERROR log file:
>>>>>>>>>>
>>>>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line]
>>>>>>>>>> msg
>>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>>>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>>>>>>
>>>>>>>>>> The .WARNING file had the same content.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -rgs
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: mesos agent not recovering after ZK init failure

Reply via email to