Re: Zookeeper cluster changes

2015-11-13 Thread Donald Laidlaw
 design, we are all supposed to run the mesos daemons under 
>>> "supervision", which means auto-restart after they crash.  This can be done 
>>> with monit/god/runit/etc.
>>> 
>>> So, to perform maintenance on ZooKeeper, I would firstly ensure the 
>>> mesos-master processes are running under "supervision" so that they restart 
>>> quickly after a ZK connectivity failure occurs.  Then proceed with standard 
>>> ZooKeeper maintenance (exhibitor-based or manual), pausing between downing 
>>> of ZK servers to ensure you have "enough" mesos-master processes running.  
>>> (I *would* say a "pausing until you have a quorum of mesos-masters up", but 
>>> if you only have 2 of 3 up and then take down the ZK that the leader is 
>>> connected to, that would be temporarily bad.  So I'd make sure they're all 
>>> up.)
>>> 
>>> - Erik
>>> 
>>> On Mon, Nov 9, 2015 at 11:07 PM, Marco Massenzio <ma...@mesosphere.io 
>>> <mailto:ma...@mesosphere.io>> wrote:
>>> The way I would do it in a production cluster would be *not* to use 
>>> directly IP addresses for the ZK ensemble, but instead rely on some form of 
>>> internal DNS and use internally-resolvable hostnames (eg, {zk1, zk2, 
>>> ...}.prod.example.com <http://prod.example.com/> etc) and have the 
>>> provisioning tooling (Chef, Puppet, Ansible, what have you) handle the 
>>> setting of the hostname when restarting/replacing a failing/crashed ZK 
>>> server.
>>> 
>>> This way your list of zk's to Mesos never changes, even though the FQN's 
>>> will map to different IPs / VMs.
>>> 
>>> Obviously, this may not be always desirable / feasible (eg, if your prod 
>>> environment does not support DNS resolution).
>>> 
>>> You are correct in that Mesos does not currently support dynamically 
>>> changing the ZK's addresses, but I don't know whether that's a limitation 
>>> of Mesos code or of the ZK C++ client driver.
>>> I'll look into it and let you know what I find (if anything).
>>> 
>>> --
>>> Marco Massenzio
>>> Distributed Systems Engineer
>>> http://codetrips.com <http://codetrips.com/>
>>> 
>>> On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaid...@me.com 
>>> <mailto:donlaid...@me.com>> wrote:
>>> How do mesos masters and slaves react to zookeeper cluster changes? When 
>>> the masters and slaves start they are given a set of addresses to connect 
>>> to zookeeper. But over time, one of those zookeepers fails, and is replaced 
>>> by a new server at a new address. How should this be handled in the mesos 
>>> servers?
>>> 
>>> I am guessing that mesos does not automatically detect and react to that 
>>> change. But obviously we should do something to keep the mesos servers 
>>> happy as well. What should be do?
>>> 
>>> The obvious thing is to stop the mesos servers, one at a time, and restart 
>>> them with the new configuration. But it would be really nice to be able to 
>>> do this dynamically without restarting the server. After all, coordinating 
>>> a rolling restart is a fairly hard job.
>>> 
>>> Any suggestions or pointers?
>>> 
>>> Best regards,
>>> Don Laidlaw
>>> 
>>> 
>>> 
>>> 
>> 
> 



Re: Zookeeper cluster changes

2015-11-11 Thread Joseph Smith
t;> ...}.prod.example.com <http://prod.example.com/> etc) and have the 
>> provisioning tooling (Chef, Puppet, Ansible, what have you) handle the 
>> setting of the hostname when restarting/replacing a failing/crashed ZK 
>> server.
>> 
>> This way your list of zk's to Mesos never changes, even though the FQN's 
>> will map to different IPs / VMs.
>> 
>> Obviously, this may not be always desirable / feasible (eg, if your prod 
>> environment does not support DNS resolution).
>> 
>> You are correct in that Mesos does not currently support dynamically 
>> changing the ZK's addresses, but I don't know whether that's a limitation of 
>> Mesos code or of the ZK C++ client driver.
>> I'll look into it and let you know what I find (if anything).
>> 
>> --
>> Marco Massenzio
>> Distributed Systems Engineer
>> http://codetrips.com <http://codetrips.com/>
>> 
>> On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaid...@me.com 
>> <mailto:donlaid...@me.com>> wrote:
>> How do mesos masters and slaves react to zookeeper cluster changes? When the 
>> masters and slaves start they are given a set of addresses to connect to 
>> zookeeper. But over time, one of those zookeepers fails, and is replaced by 
>> a new server at a new address. How should this be handled in the mesos 
>> servers?
>> 
>> I am guessing that mesos does not automatically detect and react to that 
>> change. But obviously we should do something to keep the mesos servers happy 
>> as well. What should be do?
>> 
>> The obvious thing is to stop the mesos servers, one at a time, and restart 
>> them with the new configuration. But it would be really nice to be able to 
>> do this dynamically without restarting the server. After all, coordinating a 
>> rolling restart is a fairly hard job.
>> 
>> Any suggestions or pointers?
>> 
>> Best regards,
>> Don Laidlaw
>> 
>> 
>> 
>> 
> 



Re: Zookeeper cluster changes

2015-11-10 Thread Donald Laidlaw
I agree, you want to apply the changes gradually so as not to lose a quorum. 
The problem is automating this so that it happens in a lights-out environment, 
in the cloud, without some poor slob's pager going off in the middle of the 
night :)

While health checks can detect and replace a dead server reliably on any number 
of clouds, the new server comes up with a new IP address. This server can 
reliably join into zookeeper ensemble. However, it is tough to automate the 
rolling restart of the other mesos servers, both masters and slaves, that needs 
to occur to keep them happy. 

One thing I have not tried is to just ignore the change, and use something to 
detect the masters just prior to starting mesos. If they truly fail fast, then 
if they lose a zookeeper connection, then maybe they don’t care that they have 
been started with an out-of-date list of zookeeper servers.

What does mesos-master and mesos-slave do with a list of zookeeper servers to 
connect to? Just try them in order until one works, then use that one until it 
fails? If so, and it fails fast, then letting it continue to run with a stale 
list will have no ill effects. Or does it keep trying the servers in the list 
when a connection fails? 

Don Laidlaw


> On Nov 10, 2015, at 4:42 AM, Erik Weathers <eweath...@groupon.com> wrote:
> 
> Keep in mind that mesos is designed to "fail fast".  So when there are 
> problems (such as losing connectivity to the resolved ZooKeeper IP) the 
> daemon(s) (master & slave) die.
> 
> Due to this design, we are all supposed to run the mesos daemons under 
> "supervision", which means auto-restart after they crash.  This can be done 
> with monit/god/runit/etc.
> 
> So, to perform maintenance on ZooKeeper, I would firstly ensure the 
> mesos-master processes are running under "supervision" so that they restart 
> quickly after a ZK connectivity failure occurs.  Then proceed with standard 
> ZooKeeper maintenance (exhibitor-based or manual), pausing between downing of 
> ZK servers to ensure you have "enough" mesos-master processes running.  (I 
> *would* say a "pausing until you have a quorum of mesos-masters up", but if 
> you only have 2 of 3 up and then take down the ZK that the leader is 
> connected to, that would be temporarily bad.  So I'd make sure they're all 
> up.)
> 
> - Erik
> 
> On Mon, Nov 9, 2015 at 11:07 PM, Marco Massenzio <ma...@mesosphere.io 
> <mailto:ma...@mesosphere.io>> wrote:
> The way I would do it in a production cluster would be *not* to use directly 
> IP addresses for the ZK ensemble, but instead rely on some form of internal 
> DNS and use internally-resolvable hostnames (eg, {zk1, zk2, 
> ...}.prod.example.com <http://prod.example.com/> etc) and have the 
> provisioning tooling (Chef, Puppet, Ansible, what have you) handle the 
> setting of the hostname when restarting/replacing a failing/crashed ZK server.
> 
> This way your list of zk's to Mesos never changes, even though the FQN's will 
> map to different IPs / VMs.
> 
> Obviously, this may not be always desirable / feasible (eg, if your prod 
> environment does not support DNS resolution).
> 
> You are correct in that Mesos does not currently support dynamically changing 
> the ZK's addresses, but I don't know whether that's a limitation of Mesos 
> code or of the ZK C++ client driver.
> I'll look into it and let you know what I find (if anything).
> 
> --
> Marco Massenzio
> Distributed Systems Engineer
> http://codetrips.com <http://codetrips.com/>
> 
> On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaid...@me.com 
> <mailto:donlaid...@me.com>> wrote:
> How do mesos masters and slaves react to zookeeper cluster changes? When the 
> masters and slaves start they are given a set of addresses to connect to 
> zookeeper. But over time, one of those zookeepers fails, and is replaced by a 
> new server at a new address. How should this be handled in the mesos servers?
> 
> I am guessing that mesos does not automatically detect and react to that 
> change. But obviously we should do something to keep the mesos servers happy 
> as well. What should be do?
> 
> The obvious thing is to stop the mesos servers, one at a time, and restart 
> them with the new configuration. But it would be really nice to be able to do 
> this dynamically without restarting the server. After all, coordinating a 
> rolling restart is a fairly hard job.
> 
> Any suggestions or pointers?
> 
> Best regards,
> Don Laidlaw
> 
> 
> 
> 



Re: Zookeeper cluster changes

2015-11-10 Thread Erik Weathers
Keep in mind that mesos is designed to "fail fast".  So when there are
problems (such as losing connectivity to the resolved ZooKeeper IP) the
daemon(s) (master & slave) die.

Due to this design, we are all supposed to run the mesos daemons under
"supervision", which means auto-restart after they crash.  This can be done
with monit/god/runit/etc.

So, to perform maintenance on ZooKeeper, I would firstly ensure the
mesos-master processes are running under "supervision" so that they restart
quickly after a ZK connectivity failure occurs.  Then proceed with standard
ZooKeeper maintenance (exhibitor-based or manual), pausing between downing
of ZK servers to ensure you have "enough" mesos-master processes running.
 (I *would* say a "pausing until you have a quorum of mesos-masters up",
but if you only have 2 of 3 up and then take down the ZK that the leader is
connected to, that would be temporarily bad.  So I'd make sure they're all
up.)

- Erik

On Mon, Nov 9, 2015 at 11:07 PM, Marco Massenzio <ma...@mesosphere.io>
wrote:

> The way I would do it in a production cluster would be *not* to use
> directly IP addresses for the ZK ensemble, but instead rely on some form of
> internal DNS and use internally-resolvable hostnames (eg, {zk1, zk2, ...}.
> prod.example.com etc) and have the provisioning tooling (Chef, Puppet,
> Ansible, what have you) handle the setting of the hostname when
> restarting/replacing a failing/crashed ZK server.
>
> This way your list of zk's to Mesos never changes, even though the FQN's
> will map to different IPs / VMs.
>
> Obviously, this may not be always desirable / feasible (eg, if your prod
> environment does not support DNS resolution).
>
> You are correct in that Mesos does not currently support dynamically
> changing the ZK's addresses, but I don't know whether that's a limitation
> of Mesos code or of the ZK C++ client driver.
> I'll look into it and let you know what I find (if anything).
>
> --
> *Marco Massenzio*
> Distributed Systems Engineer
> http://codetrips.com
>
> On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaid...@me.com> wrote:
>
>> How do mesos masters and slaves react to zookeeper cluster changes? When
>> the masters and slaves start they are given a set of addresses to connect
>> to zookeeper. But over time, one of those zookeepers fails, and is replaced
>> by a new server at a new address. How should this be handled in the mesos
>> servers?
>>
>> I am guessing that mesos does not automatically detect and react to that
>> change. But obviously we should do something to keep the mesos servers
>> happy as well. What should be do?
>>
>> The obvious thing is to stop the mesos servers, one at a time, and
>> restart them with the new configuration. But it would be really nice to be
>> able to do this dynamically without restarting the server. After all,
>> coordinating a rolling restart is a fairly hard job.
>>
>> Any suggestions or pointers?
>>
>> Best regards,
>> Don Laidlaw
>>
>>
>>
>


Zookeeper cluster changes

2015-11-09 Thread Donald Laidlaw
How do mesos masters and slaves react to zookeeper cluster changes? When the 
masters and slaves start they are given a set of addresses to connect to 
zookeeper. But over time, one of those zookeepers fails, and is replaced by a 
new server at a new address. How should this be handled in the mesos servers?

I am guessing that mesos does not automatically detect and react to that 
change. But obviously we should do something to keep the mesos servers happy as 
well. What should be do?

The obvious thing is to stop the mesos servers, one at a time, and restart them 
with the new configuration. But it would be really nice to be able to do this 
dynamically without restarting the server. After all, coordinating a rolling 
restart is a fairly hard job.

Any suggestions or pointers?

Best regards,
Don Laidlaw




Re: Zookeeper cluster changes

2015-11-09 Thread Donald Laidlaw
Yeah, I know about Exhibitor and how it handles zookeeper ensemble changes.

My question was about how to handle the Mesos servers.

What do you have to do with Mesos, when the zookeeper ensemble changes, to keep 
the mesos servers happy and healthy? 

Don Laidlaw
866 Cobequid Rd.
Lower Sackville, NS
B4C 4E1
Phone: 1-902-576-5185
Cell: 1-902-401-6771

> On Nov 9, 2015, at 12:28 PM, tommy xiao <xia...@gmail.com> wrote:
> 
> Good News, Netflix release a tools can do it:
> https://github.com/Netflix/exhibitor/wiki/Rolling-Ensemble-Change 
> <https://github.com/Netflix/exhibitor/wiki/Rolling-Ensemble-Change>
> 
>  have a try it.
> 
> 2015-11-09 22:01 GMT+08:00 Donald Laidlaw <donlaid...@me.com 
> <mailto:donlaid...@me.com>>:
> How do mesos masters and slaves react to zookeeper cluster changes? When the 
> masters and slaves start they are given a set of addresses to connect to 
> zookeeper. But over time, one of those zookeepers fails, and is replaced by a 
> new server at a new address. How should this be handled in the mesos servers?
> 
> I am guessing that mesos does not automatically detect and react to that 
> change. But obviously we should do something to keep the mesos servers happy 
> as well. What should be do?
> 
> The obvious thing is to stop the mesos servers, one at a time, and restart 
> them with the new configuration. But it would be really nice to be able to do 
> this dynamically without restarting the server. After all, coordinating a 
> rolling restart is a fairly hard job.
> 
> Any suggestions or pointers?
> 
> Best regards,
> Don Laidlaw
> 
> 
> 
> 
> 
> -- 
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com <http://gmail.com/>


Re: Zookeeper cluster changes

2015-11-09 Thread tommy xiao
Good News, Netflix release a tools can do it:
https://github.com/Netflix/exhibitor/wiki/Rolling-Ensemble-Change

 have a try it.

2015-11-09 22:01 GMT+08:00 Donald Laidlaw <donlaid...@me.com>:

> How do mesos masters and slaves react to zookeeper cluster changes? When
> the masters and slaves start they are given a set of addresses to connect
> to zookeeper. But over time, one of those zookeepers fails, and is replaced
> by a new server at a new address. How should this be handled in the mesos
> servers?
>
> I am guessing that mesos does not automatically detect and react to that
> change. But obviously we should do something to keep the mesos servers
> happy as well. What should be do?
>
> The obvious thing is to stop the mesos servers, one at a time, and restart
> them with the new configuration. But it would be really nice to be able to
> do this dynamically without restarting the server. After all, coordinating
> a rolling restart is a fairly hard job.
>
> Any suggestions or pointers?
>
> Best regards,
> Don Laidlaw
>
>
>


-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com


Re: Zookeeper cluster changes

2015-11-09 Thread Marco Massenzio
The way I would do it in a production cluster would be *not* to use
directly IP addresses for the ZK ensemble, but instead rely on some form of
internal DNS and use internally-resolvable hostnames (eg, {zk1, zk2, ...}.
prod.example.com etc) and have the provisioning tooling (Chef, Puppet,
Ansible, what have you) handle the setting of the hostname when
restarting/replacing a failing/crashed ZK server.

This way your list of zk's to Mesos never changes, even though the FQN's
will map to different IPs / VMs.

Obviously, this may not be always desirable / feasible (eg, if your prod
environment does not support DNS resolution).

You are correct in that Mesos does not currently support dynamically
changing the ZK's addresses, but I don't know whether that's a limitation
of Mesos code or of the ZK C++ client driver.
I'll look into it and let you know what I find (if anything).

--
*Marco Massenzio*
Distributed Systems Engineer
http://codetrips.com

On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaid...@me.com> wrote:

> How do mesos masters and slaves react to zookeeper cluster changes? When
> the masters and slaves start they are given a set of addresses to connect
> to zookeeper. But over time, one of those zookeepers fails, and is replaced
> by a new server at a new address. How should this be handled in the mesos
> servers?
>
> I am guessing that mesos does not automatically detect and react to that
> change. But obviously we should do something to keep the mesos servers
> happy as well. What should be do?
>
> The obvious thing is to stop the mesos servers, one at a time, and restart
> them with the new configuration. But it would be really nice to be able to
> do this dynamically without restarting the server. After all, coordinating
> a rolling restart is a fairly hard job.
>
> Any suggestions or pointers?
>
> Best regards,
> Don Laidlaw
>
>
>