Hi Jeremy
 Its great that you are making progress but I doubt if this is what you intend 
to achieve since network failures are a valid state in distributed systems. If 
you think there is a special case you are trying to solve, I suggest proposing 
a design document for review.
  For ZK client code, I would suggest asking the zookeeper mailing list.

thanks
-Jojy

> On Nov 9, 2015, at 7:56 PM, Jeremy Olexa <jol...@spscommerce.com> wrote:
> 
> Alright, great, I'm making some progress,
> 
> I did a simple copy/paste modification and recompiled mesos. The keepalive 
> timer is set from slave to master so this is an improvement for me. I didn't 
> test the other direction yet - 
> https://gist.github.com/jolexa/ee9e152aa7045c558e02 
> <https://gist.github.com/jolexa/ee9e152aa7045c558e02> - I'd like to file an 
> enhancement request for this since it seems like an improvement for other 
> people as well, after some real world testing
> 
> I'm having some harder time figuring out the zk client code. I started by 
> modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a) my 
> change wasn't correct or b) I'm modifying a wrong file, since I just assumed 
> using the c client. Is this the correct place?
> 
> Thanks much,
> Jeremy
> 
> 
> From: Jojy Varghese <j...@mesosphere.io>
> Sent: Monday, November 9, 2015 2:09 PM
> To: user@mesos.apache.org
> Subject: Re: Mesos and Zookeeper TCP keepalive
>  
> Hi Jeremy
>  The “network” code is at "3rdparty/libprocess/include/process/network.hpp” , 
> "3rdparty/libprocess/src/poll_socket.hpp/cpp”. 
> 
> thanks
> jojy
> 
> 
>> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <jol...@spscommerce.com 
>> <mailto:jol...@spscommerce.com>> wrote:
>> 
>> Hi all,
>> 
>> Jojy, That is correct, but more specifically a keepalive timer from slave to 
>> master and slave to zookeeper. Can you send a link to the portion of the 
>> code that builds the socket/connection? Is there any reason to not set the 
>> SO_KEEPALIVE option in your opinion?
>> 
>> hasodent, I'm not looking for keepalive between zk quorum members, like the 
>> ZOOKEEPER JIRA is referencing.
>> 
>> Thanks,
>> Jeremy
>> 
>> 
>> From: Jojy Varghese <j...@mesosphere.io <mailto:j...@mesosphere.io>>
>> Sent: Sunday, November 8, 2015 8:37 PM
>> To: user@mesos.apache.org <mailto:user@mesos.apache.org>
>> Subject: Re: Mesos and Zookeeper TCP keepalive
>>  
>> Hi Jeremy
>>   Are you trying to establish a keepalive timer between mesos master and 
>> mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE 
>> option is  not set on an accepting socket. 
>> 
>> -Jojy
>> 
>>> On Nov 8, 2015, at 8:43 AM, haosdent <haosd...@gmail.com 
>>> <mailto:haosd...@gmail.com>> wrote:
>>> 
>>> I think keepalive option should be set in Zookeeper, not in Mesos. See this 
>>> related issue in Zookeeper. 
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085
>>>  
>>> <https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085>
>>> 
>>> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <jol...@spscommerce.com 
>>> <mailto:jol...@spscommerce.com>> wrote:
>>> Hello all,
>>> 
>>> We have been fighting some network/session disconnection issues between 
>>> datacenters and I'm curious if there is anyway to enable tcp keepalive on 
>>> the zookeeper/mesos sockets? If there was a way, then the sysctl tcp kernel 
>>> settings would be used. I believe keepalive has to be enabled by the 
>>> software which is opening the connection. (That is my understanding anyway)
>>> 
>>> Here is what I see via netstat --timers -tn:
>>> tcp        0      0 172.18.1.1:55842 <http://172.18.1.1:55842/>      
>>> 10.10.1.1:2181 <http://10.10.1.1:2181/>      ESTABLISHED off (0.00/0/0)
>>> tcp        0      0 172.18.1.1:49702      10.10.1.1:5050      ESTABLISHED 
>>> off (0.00/0/0)
>>> 
>>> 
>>> Where 172 is the mesos-slave network and 10 is the mesos-master network. 
>>> The "off" keyword means that keepalive's are not being sent.
>>> 
>>> I've trolled through JIRA, git, etc and cannot easily determine if this is 
>>> expected behavior or should be an enhancement request. Any ideas?
>>> 
>>> Thanks much!
>>> -Jeremy
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> Haosdent Huang

Reply via email to