Re: mesos agent not recovering after ZK init failure

Andrew Leung Fri, 15 Jul 2016 11:48:03 -0700

Hi Jie,

Yes, that is how we are working around this issue. However, we wanted to see if 
others were hitting this issue as well. If others had a similar Mesos Slave on 
ZFS setup, it might be worth considering a disk space calculation approach that 
works more reliably with ZFS or at least calling out the need to specify the 
disk resource explicitly.


Thanks for the help.
Andrew

> On Jul 15, 2016, at 11:41 AM, Jie Yu <[email protected]> wrote:
> 
> Can you hard code your disk size using --resources flag?
> 
> 
> On Fri, Jul 15, 2016 at 11:31 AM, Sharma Podila <[email protected] 
> <mailto:[email protected]>> wrote:
> We had this issue happen again and were able to debug further. The cause for 
> agent not being able to restart is that one of the resources (disk) changed 
> its total size since the last restart. However, this error does not show up 
> in INFO/WARN/ERROR files. We saw it in stdout only when manually restarting 
> the agent. It would be good to have all messages going to stdout/stderr show 
> up in the logs. Is there a config setting for it that I missed?
> 
> The disk size total is changing sometimes on our agents. It is off by a few 
> bytes (seeing ~10 bytes difference out of, say, 600 GB). We use ZFS on our 
> agents to manage the disk partition. From my colleague, Andrew (copied here):
> 
> The current Mesos approach (i.e., `statvfs()` for total blocks and assume 
> that never changes) won’t work reliably on ZFS
> 
> Anyone else experience this? We can likely hack a workaround for this by 
> reporting the "whole GBs" of the disk so we are insensitive to small changes 
> in the total size. But, not sure if the changes can be larger due to Andrew's 
> point above.
> 
> 
> On Mon, Mar 7, 2016 at 6:00 PM, Sharma Podila <[email protected] 
> <mailto:[email protected]>> wrote:
> Sure, will do.
> 
> 
> On Mon, Mar 7, 2016 at 5:54 PM, Benjamin Mahler <[email protected] 
> <mailto:[email protected]>> wrote:
> Very surprising.. I don't have any ideas other than trying to replicate the 
> scenario in a test.
> 
> Please do keep us posted if you encounter it again and gain more information.
> 
> On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila <[email protected] 
> <mailto:[email protected]>> wrote:
> MESOS-4795 created.
> 
> I don't have the exit status. We haven't seen a repeat yet, will catch the 
> exit status next time it happens.
> 
> Yes, removing the metadata directory was the only way it was resolved. This 
> happened on multiple hosts requiring the same resolution. 
> 
> 
> On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <[email protected] 
> <mailto:[email protected]>> wrote:
> Feel free to create one. I don't have enough information to know what the 
> issue is without doing some further investigation, but if the situation you 
> described is accurate it seems like a there are two strange bugs:
> 
> -the silent exit (do you not have the exit status?), and
> -the flapping from ZK errors that needed the meta data directory to be 
> removed to resolve (are you convinced the removal of the meta directory is 
> what solved it?)
> 
> It would be good to track these issues in case they crop up again.
> 
> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Ben, 
> 
> Let me know if there is a new issue created for this, I would like to add 
> myself to watch it. 
> Thanks.
> 
> 
> 
> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Ben, 
> 
> That is accurate, with one additional line:
> 
> -Agent running fine with 0.24.1
> -Transient ZK issues, slave flapping with zookeeper_init failure
> -ZK issue resolved
> -Most agents stop flapping and function correctly
> -Some agents continue flapping, but silent exit after printing the 
> detector.cpp:481 log line.
> -The agents that continue to flap repaired with manual removal of contents in 
> mesos-slave's working dir
> 
> 
> 
> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <[email protected] 
> <mailto:[email protected]>> wrote:
> Hey Sharma,
> 
> I didn't quite follow the timeline of events here or how the agent logs you 
> posted fit into the timeline of events. Here's how I interpreted:
> 
> -Agent running fine with 0.24.1
> -Transient ZK issues, slave flapping with zookeeper_init failure
> -ZK issue resolved
> -Most agents stop flapping and function correctly
> -Some agents continue flapping, but silent exit after printing the 
> detector.cpp:481 log line.
> 
> Is this accurate? What is the exit code from the silent exit?
> 
> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <[email protected] 
> <mailto:[email protected]>> wrote:
> Maybe related, but, maybe different since a new process seems to find the 
> master leader and still aborts, never recovering with restarts until work dir 
> data is removed. 
> It is happening in 0.24.1. 
> 
> 
> 
> 
> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <[email protected] 
> <mailto:[email protected]>> wrote:
> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I guess you are 
> saying it is somehow related but not exactly the same issue?
> 
> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <[email protected] 
> <mailto:[email protected]>> wrote:
> On 9 February 2016 at 11:04, Sharma Podila <[email protected] 
> <mailto:[email protected]>> wrote:
> We had a few mesos agents stuck in an unrecoverable state after a transient 
> ZK init error. Is this a known problem? I wasn't able to find an existing 
> jira item for this. We are on 0.24.1 at this time.
> 
> Most agents were fine, except a handful. These handful of agents had their 
> mesos-slave process constantly restarting. The .INFO logfile had the 
> following contents below, before the process exited, with no error messages. 
> The restarts were happening constantly due to an existing service keep alive 
> strategy.
> 
> To fix it, we manually stopped the service, removed the data in the working 
> dir, and then restarted it. The mesos-slave process was able to restart then. 
> The manual intervention needed to resolve it is problematic.
> 
> Here's the contents of the various log files on the agent:
> 
> The .INFO logfile for one of the restarts before mesos-slave process exited 
> with no other error messages:
> 
> Log file created at: 2016/02/09 02:12:48
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by builds
> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix
> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 
> 1)@10.138.146.230:7101 <http://10.138.146.230:7101/>
> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup: 
> --appc_store_dir="/tmp/mesos/store/appc" 
> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" <snip>"
> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources: 
> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip>
> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
> I0209 02:12:48.516139 97299 group.cpp:331] Group process 
> (group(1)@10.138.146.230:7101 <http://10.138.146.230:7101/>) connected to 
> ZooKeeper
> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: queue 
> size (joins, cancels, datas) = (0, 0, 0)
> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path 
> '/titus/main/mesos' in ZooKeeper
> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader: 
> (id='209')
> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get 
> '/titus/main/mesos/json.info_0000000209' in ZooKeeper
> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from 
> '/mnt/data/mesos/meta'
> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file 
> '/mnt/data/mesos/meta/resources/resources.info <http://resources.info/>'
> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master 
> ([email protected]:7103 <http://[email protected]:7103/>) is 
> detected
> 
> 
> The .FATAL log file when the original transient ZK error occurred:
> 
> Log file created at: 2016/02/05 17:21:37
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> 
> 
> The .ERROR log file:
> 
> Log file created at: 2016/02/05 17:21:37
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> 
> The .WARNING file had the same content. 
> 
> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326 
> <https://issues.apache.org/jira/browse/MESOS-1326>
> 
> 
> -rgs 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: mesos agent not recovering after ZK init failure

Reply via email to