Hi Jeremy, Can you read the description of these <https://github.com/apache/mesos/blob/249bc26306574d9db0527c04b7a83a1f1e75f71b/src/master/flags.cpp#L393-L422> parameters on the master, and possibly share your values for these flags?
It seems from the re-registration attempt on the agent, that the master has already treated the agent as "failed", and so will tell it to shut down on any re-registration attempt. I'm curious if there is a conflict (or too narrow of a time gap) of timeouts in your environment to allow re-registration by the agent after the agent notices it needs to re-establish the connection. — *Joris Van Remoortere* Mesosphere On Tue, Nov 10, 2015 at 5:02 AM, Jeremy Olexa <[email protected]> wrote: > Hi Tommy, Erik, all, > > > You are correct in your assumption that I'm trying to solve for a one hour > session expire time on a firewall. For some more background info, our > master cluster is in datacenter X, the slaves in X will stay "up" for days > and days. The slaves in a different datacenter, Y, connected to that master > cluster will stay "up" for about a few days and restart. The master cluster > is healthy, with a stable leader for months (no flapping), same for the ZK > "leader". There are about 35 slaves in datacenter Y. Maybe the firewall > session timer is a red herring because the slave restart is seemingly > random (the slave with the highest uptime is 6 days, but a handful only > have uptime of a day) > > > I've started debugging this awhile ago, and the gist of the logs is here: > https://gist.github.com/jolexa/1a80e26a4b017846d083 I've posted this back > in October seeking help and Benjamin suggested network issues in both > directions, so I thought firewall. > > > Thanks for any hints, > > Jeremy > > ------------------------------ > *From:* tommy xiao <[email protected]> > *Sent:* Tuesday, November 10, 2015 3:07 AM > > *To:* [email protected] > *Subject:* Re: Mesos and Zookeeper TCP keepalive > > same here , same question with Erik. could you please input more > background info, thanks > > 2015-11-10 15:56 GMT+08:00 Erik Weathers <[email protected]>: > >> It would really help if you (Jeremy) explained the *actual* problem you >> are facing. I'm *guessing* that it's a firewall timing out the sessions >> because there isn't activity on them for whatever the timeout of the >> firewall is? It seems likely to be unreasonably short, given that mesos >> has constant activity between master and >> slave/agent/whatever-it-is-being-called-nowadays-but-not-really-yet-maybe-someday-for-reals. >> >> - Erik >> >> On Mon, Nov 9, 2015 at 10:00 PM, Jojy Varghese <[email protected]> >> wrote: >> >>> Hi Jeremy >>> Its great that you are making progress but I doubt if this is what you >>> intend to achieve since network failures are a valid state in distributed >>> systems. If you think there is a special case you are trying to solve, I >>> suggest proposing a design document for review. >>> For ZK client code, I would suggest asking the zookeeper mailing list. >>> >>> thanks >>> -Jojy >>> >>> On Nov 9, 2015, at 7:56 PM, Jeremy Olexa <[email protected]> wrote: >>> >>> Alright, great, I'm making some progress, >>> >>> I did a simple copy/paste modification and recompiled mesos. The >>> keepalive timer is set from slave to master so this is an improvement for >>> me. I didn't test the other direction yet - >>> https://gist.github.com/jolexa/ee9e152aa7045c558e02 - I'd like to file >>> an enhancement request for this since it seems like an improvement for >>> other people as well, after some real world testing >>> >>> I'm having some harder time figuring out the zk client code. I started >>> by modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a) >>> my change wasn't correct or b) I'm modifying a wrong file, since I >>> just assumed using the c client. Is this the correct place? >>> >>> Thanks much, >>> Jeremy >>> >>> >>> ------------------------------ >>> *From:* Jojy Varghese <[email protected]> >>> *Sent:* Monday, November 9, 2015 2:09 PM >>> *To:* [email protected] >>> *Subject:* Re: Mesos and Zookeeper TCP keepalive >>> >>> Hi Jeremy >>> The “network” code is at >>> "3rdparty/libprocess/include/process/network.hpp” , >>> "3rdparty/libprocess/src/poll_socket.hpp/cpp”. >>> >>> thanks >>> jojy >>> >>> >>> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <[email protected]> wrote: >>> >>> Hi all, >>> >>> Jojy, That is correct, but more specifically a keepalive timer from >>> slave to master and slave to zookeeper. Can you send a link to the portion >>> of the code that builds the socket/connection? Is there any reason to not >>> set the SO_KEEPALIVE option in your opinion? >>> >>> hasodent, I'm not looking for keepalive between zk quorum members, like >>> the ZOOKEEPER JIRA is referencing. >>> >>> Thanks, >>> Jeremy >>> >>> >>> ------------------------------ >>> *From:* Jojy Varghese <[email protected]> >>> *Sent:* Sunday, November 8, 2015 8:37 PM >>> *To:* [email protected] >>> *Subject:* Re: Mesos and Zookeeper TCP keepalive >>> >>> Hi Jeremy >>> Are you trying to establish a keepalive timer between mesos master and >>> mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE >>> option is not set on an accepting socket. >>> >>> -Jojy >>> >>> On Nov 8, 2015, at 8:43 AM, haosdent <[email protected]> wrote: >>> >>> I think keepalive option should be set in Zookeeper, not in Mesos. See >>> this related issue in Zookeeper. >>> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085 >>> >>> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <[email protected]> >>> wrote: >>> >>>> Hello all, >>>> >>>> We have been fighting some network/session disconnection issues between >>>> datacenters and I'm curious if there is anyway to enable tcp keepalive on >>>> the zookeeper/mesos sockets? If there was a way, then the sysctl tcp >>>> kernel settings would be used. I believe keepalive has to be enabled by the >>>> software which is opening the connection. (That is my understanding anyway) >>>> >>>> Here is what I see via netstat --timers -tn: >>>> tcp 0 0 172.18.1.1:55842 10.10.1.1:2181 >>>> ESTABLISHED off (0.00/0/0) >>>> tcp 0 0 172.18.1.1:49702 10.10.1.1:5050 >>>> ESTABLISHED off (0.00/0/0) >>>> >>>> >>>> Where 172 is the mesos-slave network and 10 is the mesos-master >>>> network. The "off" keyword means that keepalive's are not being sent. >>>> >>>> I've trolled through JIRA, git, etc and cannot easily determine if this >>>> is expected behavior or should be an enhancement request. Any ideas? >>>> >>>> Thanks much! >>>> -Jeremy >>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Haosdent Huang >>> >>> >>> >> > > > -- > Deshi Xiao > Twitter: xds2000 > E-mail: xiaods(AT)gmail.com >

