Hi Jeremy Its great that you are making progress but I doubt if this is what you intend to achieve since network failures are a valid state in distributed systems. If you think there is a special case you are trying to solve, I suggest proposing a design document for review. For ZK client code, I would suggest asking the zookeeper mailing list.
thanks -Jojy > On Nov 9, 2015, at 7:56 PM, Jeremy Olexa <jol...@spscommerce.com> wrote: > > Alright, great, I'm making some progress, > > I did a simple copy/paste modification and recompiled mesos. The keepalive > timer is set from slave to master so this is an improvement for me. I didn't > test the other direction yet - > https://gist.github.com/jolexa/ee9e152aa7045c558e02 > <https://gist.github.com/jolexa/ee9e152aa7045c558e02> - I'd like to file an > enhancement request for this since it seems like an improvement for other > people as well, after some real world testing > > I'm having some harder time figuring out the zk client code. I started by > modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a) my > change wasn't correct or b) I'm modifying a wrong file, since I just assumed > using the c client. Is this the correct place? > > Thanks much, > Jeremy > > > From: Jojy Varghese <j...@mesosphere.io> > Sent: Monday, November 9, 2015 2:09 PM > To: user@mesos.apache.org > Subject: Re: Mesos and Zookeeper TCP keepalive > > Hi Jeremy > The “network” code is at "3rdparty/libprocess/include/process/network.hpp” , > "3rdparty/libprocess/src/poll_socket.hpp/cpp”. > > thanks > jojy > > >> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <jol...@spscommerce.com >> <mailto:jol...@spscommerce.com>> wrote: >> >> Hi all, >> >> Jojy, That is correct, but more specifically a keepalive timer from slave to >> master and slave to zookeeper. Can you send a link to the portion of the >> code that builds the socket/connection? Is there any reason to not set the >> SO_KEEPALIVE option in your opinion? >> >> hasodent, I'm not looking for keepalive between zk quorum members, like the >> ZOOKEEPER JIRA is referencing. >> >> Thanks, >> Jeremy >> >> >> From: Jojy Varghese <j...@mesosphere.io <mailto:j...@mesosphere.io>> >> Sent: Sunday, November 8, 2015 8:37 PM >> To: user@mesos.apache.org <mailto:user@mesos.apache.org> >> Subject: Re: Mesos and Zookeeper TCP keepalive >> >> Hi Jeremy >> Are you trying to establish a keepalive timer between mesos master and >> mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE >> option is not set on an accepting socket. >> >> -Jojy >> >>> On Nov 8, 2015, at 8:43 AM, haosdent <haosd...@gmail.com >>> <mailto:haosd...@gmail.com>> wrote: >>> >>> I think keepalive option should be set in Zookeeper, not in Mesos. See this >>> related issue in Zookeeper. >>> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085 >>> >>> <https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085> >>> >>> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <jol...@spscommerce.com >>> <mailto:jol...@spscommerce.com>> wrote: >>> Hello all, >>> >>> We have been fighting some network/session disconnection issues between >>> datacenters and I'm curious if there is anyway to enable tcp keepalive on >>> the zookeeper/mesos sockets? If there was a way, then the sysctl tcp kernel >>> settings would be used. I believe keepalive has to be enabled by the >>> software which is opening the connection. (That is my understanding anyway) >>> >>> Here is what I see via netstat --timers -tn: >>> tcp 0 0 172.18.1.1:55842 <http://172.18.1.1:55842/> >>> 10.10.1.1:2181 <http://10.10.1.1:2181/> ESTABLISHED off (0.00/0/0) >>> tcp 0 0 172.18.1.1:49702 10.10.1.1:5050 ESTABLISHED >>> off (0.00/0/0) >>> >>> >>> Where 172 is the mesos-slave network and 10 is the mesos-master network. >>> The "off" keyword means that keepalive's are not being sent. >>> >>> I've trolled through JIRA, git, etc and cannot easily determine if this is >>> expected behavior or should be an enhancement request. Any ideas? >>> >>> Thanks much! >>> -Jeremy >>> >>> >>> >>> >>> -- >>> Best Regards, >>> Haosdent Huang