Hi Jeremy,

Can you read the description of these
<https://github.com/apache/mesos/blob/249bc26306574d9db0527c04b7a83a1f1e75f71b/src/master/flags.cpp#L393-L422>
parameters on the master, and possibly share your values for these flags?

It seems from the re-registration attempt on the agent, that the master has
already treated the agent as "failed", and so will tell it to shut down on
any re-registration attempt.

I'm curious if there is a conflict (or too narrow of a time gap) of
timeouts in your environment to allow re-registration by the agent after
the agent notices it needs to re-establish the connection.

—
*Joris Van Remoortere*
Mesosphere

On Tue, Nov 10, 2015 at 5:02 AM, Jeremy Olexa <[email protected]>
wrote:

> Hi Tommy, Erik, all,
>
>
> You are correct in your assumption that I'm trying to solve for a one hour
> session expire time on a firewall. For some more background info, our
> master cluster is in datacenter X, the slaves in X will stay "up" for days
> and days. The slaves in a different datacenter, Y, connected to that master
> cluster will stay "up" for about a few days and restart. The master cluster
> is healthy, with a stable leader for months (no flapping), same for the ZK
> "leader". There are about 35 slaves in datacenter Y. Maybe the firewall
> session timer is a red herring because the slave restart is seemingly
> random (the slave with the highest uptime is 6 days, but a handful only
> have uptime of a day)
>
>
> I've started debugging this awhile ago, and the gist of the logs is here:
> https://gist.github.com/jolexa/1a80e26a4b017846d083 I've posted this back
> in October seeking help and Benjamin suggested network issues in both
> directions, so I thought firewall.
>
>
> Thanks for any hints,
>
> Jeremy
>
> ------------------------------
> *From:* tommy xiao <[email protected]>
> *Sent:* Tuesday, November 10, 2015 3:07 AM
>
> *To:* [email protected]
> *Subject:* Re: Mesos and Zookeeper TCP keepalive
>
> same here , same question with Erik. could you please input more
> background info, thanks
>
> 2015-11-10 15:56 GMT+08:00 Erik Weathers <[email protected]>:
>
>> It would really help if you (Jeremy) explained the *actual* problem you
>> are facing.  I'm *guessing* that it's a firewall timing out the sessions
>> because there isn't activity on them for whatever the timeout of the
>> firewall is?   It seems likely to be unreasonably short, given that mesos
>> has constant activity between master and
>> slave/agent/whatever-it-is-being-called-nowadays-but-not-really-yet-maybe-someday-for-reals.
>>
>> - Erik
>>
>> On Mon, Nov 9, 2015 at 10:00 PM, Jojy Varghese <[email protected]>
>> wrote:
>>
>>> Hi Jeremy
>>>  Its great that you are making progress but I doubt if this is what you
>>> intend to achieve since network failures are a valid state in distributed
>>> systems. If you think there is a special case you are trying to solve, I
>>> suggest proposing a design document for review.
>>>   For ZK client code, I would suggest asking the zookeeper mailing list.
>>>
>>> thanks
>>> -Jojy
>>>
>>> On Nov 9, 2015, at 7:56 PM, Jeremy Olexa <[email protected]> wrote:
>>>
>>> Alright, great, I'm making some progress,
>>>
>>> I did a simple copy/paste modification and recompiled mesos. The
>>> keepalive timer is set from slave to master so this is an improvement for
>>> me. I didn't test the other direction yet -
>>> https://gist.github.com/jolexa/ee9e152aa7045c558e02 - I'd like to file
>>> an enhancement request for this since it seems like an improvement for
>>> other people as well, after some real world testing
>>>
>>> I'm having some harder time figuring out the zk client code. I started
>>> by modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a)
>>> my change wasn't correct or b) I'm modifying a wrong file, since I
>>> just assumed using the c client. Is this the correct place?
>>>
>>> Thanks much,
>>> Jeremy
>>>
>>>
>>> ------------------------------
>>> *From:* Jojy Varghese <[email protected]>
>>> *Sent:* Monday, November 9, 2015 2:09 PM
>>> *To:* [email protected]
>>> *Subject:* Re: Mesos and Zookeeper TCP keepalive
>>>
>>> Hi Jeremy
>>>  The “network” code is at
>>> "3rdparty/libprocess/include/process/network.hpp” ,
>>> "3rdparty/libprocess/src/poll_socket.hpp/cpp”.
>>>
>>> thanks
>>> jojy
>>>
>>>
>>> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <[email protected]> wrote:
>>>
>>> Hi all,
>>>
>>> Jojy, That is correct, but more specifically a keepalive timer from
>>> slave to master and slave to zookeeper. Can you send a link to the portion
>>> of the code that builds the socket/connection? Is there any reason to not
>>> set the SO_KEEPALIVE option in your opinion?
>>>
>>> hasodent, I'm not looking for keepalive between zk quorum members, like
>>> the ZOOKEEPER JIRA is referencing.
>>>
>>> Thanks,
>>> Jeremy
>>>
>>>
>>> ------------------------------
>>> *From:* Jojy Varghese <[email protected]>
>>> *Sent:* Sunday, November 8, 2015 8:37 PM
>>> *To:* [email protected]
>>> *Subject:* Re: Mesos and Zookeeper TCP keepalive
>>>
>>> Hi Jeremy
>>>   Are you trying to establish a keepalive timer between mesos master and
>>> mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE
>>> option is  not set on an accepting socket.
>>>
>>> -Jojy
>>>
>>> On Nov 8, 2015, at 8:43 AM, haosdent <[email protected]> wrote:
>>>
>>> I think keepalive option should be set in Zookeeper, not in Mesos. See
>>> this related issue in Zookeeper.
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085
>>>
>>> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <[email protected]>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> We have been fighting some network/session disconnection issues between
>>>> datacenters and I'm curious if there is anyway to enable tcp keepalive on
>>>> the zookeeper/mesos sockets? If there was a way, then the sysctl tcp
>>>> kernel settings would be used. I believe keepalive has to be enabled by the
>>>> software which is opening the connection. (That is my understanding anyway)
>>>>
>>>> Here is what I see via netstat --timers -tn:
>>>> tcp        0      0 172.18.1.1:55842      10.10.1.1:2181
>>>>  ESTABLISHED off (0.00/0/0)
>>>> tcp        0      0 172.18.1.1:49702      10.10.1.1:5050
>>>>  ESTABLISHED off (0.00/0/0)
>>>>
>>>>
>>>> Where 172 is the mesos-slave network and 10 is the mesos-master
>>>> network. The "off" keyword means that keepalive's are not being sent.
>>>>
>>>> I've trolled through JIRA, git, etc and cannot easily determine if this
>>>> is expected behavior or should be an enhancement request. Any ideas?
>>>>
>>>> Thanks much!
>>>> -Jeremy
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>>
>>>
>>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>

Reply via email to