Re: zookeeper on ec2

2009-09-01 Thread Patrick Hunt

What is your client timeout? It may be too low.

also see this section on handling recoverable errors:
http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling

connection loss in particular needs special care since:
When a ZooKeeper client loses a connection to the ZooKeeper server 
there may be some requests in flight; we don't know where they were in 
their flight at the time of the connection loss. 


Patrick

Satish Bhatti wrote:

I have recently started running on EC2 and am seeing quite a few
ConnectionLoss exceptions.  Should I just catch these and retry?  Since I
assume that eventually, if the shit truly hits the fan, I will get a
SessionExpired?
Satish

On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning ted.dunn...@gmail.com wrote:


We have used EC2 quite a bit for ZK.

The basic lessons that I have learned include:

a) EC2's biggest advantage after scaling and elasticity was conformity of
configuration.  Since you are bringing machines up and down all the time,
they begin to act more like programs and you wind up with boot scripts that
give you a very predictable environment.  Nice.

b) EC2 interconnect has a lot more going on than in a dedicated VLAN.  That
can make the ZK servers appear a bit less connected.  You have to plan for
ConnectionLoss events.

c) for highest reliability, I switched to large instances.  On reflection,
I
think that was helpful, but less important than I thought at the time.

d) increasing and decreasing cluster size is nearly painless and is easily
scriptable.  To decrease, do a rolling update on the survivors to update
their configuration.  Then take down the instance you want to lose.  To
increase, do a rolling update starting with the new instances to update the
configuration to include all of the machines.  The rolling update should
bounce each ZK with several seconds between each bounce.  Rescaling the
cluster takes less than a minute which makes it comparable to EC2 instance
boot time (about 30 seconds for the Alestic ubuntu instance that we used
plus about 20 seconds for additional configuration).

On Mon, Jul 6, 2009 at 4:45 AM, David Graf david.g...@28msec.com wrote:


Hello

I wanna set up a zookeeper ensemble on amazon's ec2 service. In my

system,

zookeeper is used to run a locking service and to generate unique id's.
Currently, for testing purposes, I am only running one instance. Now, I

need

to set up an ensemble to protect my system against crashes.
The ec2 services has some differences to a normal server farm. E.g. the
data saved on the file system of an ec2 instance is lost if the instance
crashes. In the documentation of zookeeper, I have read that zookeeper

saves

snapshots of the in-memory data in the file system. Is that needed for
recovery? Logically, it would be much easier for me if this is not the

case.

Additionally, ec2 brings the advantage that serves can be switch on and

off

dynamically dependent on the load, traffic, etc. Can this advantage be
utilized for a zookeeper ensemble? Is it possible to add a zookeeper

server

dynamically to an ensemble? E.g. dependent on the in-memory load?

David





Re: zookeeper on ec2

2009-09-01 Thread Mahadev Konar
Hi Satish,

  Connectionloss is a little trickier than just retrying blindly. Please
read the following sections on this -

http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling

And the programmers guide:

http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperProgrammers.html

To learn more about how to handle CONNECTIONLOSS. The idea is that that
blindly retrying would create problems with CONNECTIONLOSS, since a
CONNECTIONLOSS does NOT necessarily mean that the zookepeer operation that
you were executing failed to execute. It might be possible that this
operation went through the servers.

Since, this has been a constant source of confusion for everyone who starts
using zookeeper we are working on a fix ZOOKEEPER-22 which will take care of
this problem and programmers would not have to worry about CONNECTIONLOSS
handling.

Thanks
mahadev




On 9/1/09 4:13 PM, Satish Bhatti cthd2...@gmail.com wrote:

 I have recently started running on EC2 and am seeing quite a few
 ConnectionLoss exceptions.  Should I just catch these and retry?  Since I
 assume that eventually, if the shit truly hits the fan, I will get a
 SessionExpired?
 Satish
 
 On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 We have used EC2 quite a bit for ZK.
 
 The basic lessons that I have learned include:
 
 a) EC2's biggest advantage after scaling and elasticity was conformity of
 configuration.  Since you are bringing machines up and down all the time,
 they begin to act more like programs and you wind up with boot scripts that
 give you a very predictable environment.  Nice.
 
 b) EC2 interconnect has a lot more going on than in a dedicated VLAN.  That
 can make the ZK servers appear a bit less connected.  You have to plan for
 ConnectionLoss events.
 
 c) for highest reliability, I switched to large instances.  On reflection,
 I
 think that was helpful, but less important than I thought at the time.
 
 d) increasing and decreasing cluster size is nearly painless and is easily
 scriptable.  To decrease, do a rolling update on the survivors to update
 their configuration.  Then take down the instance you want to lose.  To
 increase, do a rolling update starting with the new instances to update the
 configuration to include all of the machines.  The rolling update should
 bounce each ZK with several seconds between each bounce.  Rescaling the
 cluster takes less than a minute which makes it comparable to EC2 instance
 boot time (about 30 seconds for the Alestic ubuntu instance that we used
 plus about 20 seconds for additional configuration).
 
 On Mon, Jul 6, 2009 at 4:45 AM, David Graf david.g...@28msec.com wrote:
 
 Hello
 
 I wanna set up a zookeeper ensemble on amazon's ec2 service. In my
 system,
 zookeeper is used to run a locking service and to generate unique id's.
 Currently, for testing purposes, I am only running one instance. Now, I
 need
 to set up an ensemble to protect my system against crashes.
 The ec2 services has some differences to a normal server farm. E.g. the
 data saved on the file system of an ec2 instance is lost if the instance
 crashes. In the documentation of zookeeper, I have read that zookeeper
 saves
 snapshots of the in-memory data in the file system. Is that needed for
 recovery? Logically, it would be much easier for me if this is not the
 case.
 Additionally, ec2 brings the advantage that serves can be switch on and
 off
 dynamically dependent on the load, traffic, etc. Can this advantage be
 utilized for a zookeeper ensemble? Is it possible to add a zookeeper
 server
 dynamically to an ensemble? E.g. dependent on the in-memory load?
 
 David
 
 



Re: zookeeper on ec2

2009-09-01 Thread Patrick Hunt
I'm not very familiar with ec2 environment, are you doing any 
monitoring? In particular network connectivity btw nodes? Sounds like 
networking issues btw nodes (I'm assuming you've also looked at stuff 
like this http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting and 
verified that you are not swapping (see gc pressure), etc...)


Patrick

Satish Bhatti wrote:

Session timeout is 30 seconds.

On Tue, Sep 1, 2009 at 4:26 PM, Patrick Hunt ph...@apache.org wrote:


What is your client timeout? It may be too low.

also see this section on handling recoverable errors:
http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling

connection loss in particular needs special care since:
When a ZooKeeper client loses a connection to the ZooKeeper server there
may be some requests in flight; we don't know where they were in their
flight at the time of the connection loss. 

Patrick


Satish Bhatti wrote:


I have recently started running on EC2 and am seeing quite a few
ConnectionLoss exceptions.  Should I just catch these and retry?  Since I
assume that eventually, if the shit truly hits the fan, I will get a
SessionExpired?
Satish

On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning ted.dunn...@gmail.com
wrote:

 We have used EC2 quite a bit for ZK.

The basic lessons that I have learned include:

a) EC2's biggest advantage after scaling and elasticity was conformity of
configuration.  Since you are bringing machines up and down all the time,
they begin to act more like programs and you wind up with boot scripts
that
give you a very predictable environment.  Nice.

b) EC2 interconnect has a lot more going on than in a dedicated VLAN.
 That
can make the ZK servers appear a bit less connected.  You have to plan
for
ConnectionLoss events.

c) for highest reliability, I switched to large instances.  On
reflection,
I
think that was helpful, but less important than I thought at the time.

d) increasing and decreasing cluster size is nearly painless and is
easily
scriptable.  To decrease, do a rolling update on the survivors to update
their configuration.  Then take down the instance you want to lose.  To
increase, do a rolling update starting with the new instances to update
the
configuration to include all of the machines.  The rolling update should
bounce each ZK with several seconds between each bounce.  Rescaling the
cluster takes less than a minute which makes it comparable to EC2
instance
boot time (about 30 seconds for the Alestic ubuntu instance that we used
plus about 20 seconds for additional configuration).

On Mon, Jul 6, 2009 at 4:45 AM, David Graf david.g...@28msec.com
wrote:

 Hello

I wanna set up a zookeeper ensemble on amazon's ec2 service. In my


system,


zookeeper is used to run a locking service and to generate unique id's.
Currently, for testing purposes, I am only running one instance. Now, I


need


to set up an ensemble to protect my system against crashes.
The ec2 services has some differences to a normal server farm. E.g. the
data saved on the file system of an ec2 instance is lost if the instance
crashes. In the documentation of zookeeper, I have read that zookeeper


saves


snapshots of the in-memory data in the file system. Is that needed for
recovery? Logically, it would be much easier for me if this is not the


case.


Additionally, ec2 brings the advantage that serves can be switch on and


off


dynamically dependent on the load, traffic, etc. Can this advantage be
utilized for a zookeeper ensemble? Is it possible to add a zookeeper


server


dynamically to an ensemble? E.g. dependent on the in-memory load?

David






Re: zookeeper on ec2

2009-09-01 Thread Satish Bhatti
For my initial testing I am running with a single ZooKeeper server, i.e. the
ensemble only has one server.  Not sure if this is exacerbating the problem?
 I will check out the trouble shooting link you sent me.

On Tue, Sep 1, 2009 at 5:01 PM, Patrick Hunt ph...@apache.org wrote:

 I'm not very familiar with ec2 environment, are you doing any monitoring?
 In particular network connectivity btw nodes? Sounds like networking issues
 btw nodes (I'm assuming you've also looked at stuff like this
 http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting and verified that
 you are not swapping (see gc pressure), etc...)

 Patrick


 Satish Bhatti wrote:

 Session timeout is 30 seconds.

 On Tue, Sep 1, 2009 at 4:26 PM, Patrick Hunt ph...@apache.org wrote:

  What is your client timeout? It may be too low.

 also see this section on handling recoverable errors:
 http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling

 connection loss in particular needs special care since:
 When a ZooKeeper client loses a connection to the ZooKeeper server there
 may be some requests in flight; we don't know where they were in their
 flight at the time of the connection loss. 

 Patrick


 Satish Bhatti wrote:

  I have recently started running on EC2 and am seeing quite a few
 ConnectionLoss exceptions.  Should I just catch these and retry?  Since
 I
 assume that eventually, if the shit truly hits the fan, I will get a
 SessionExpired?
 Satish

 On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  We have used EC2 quite a bit for ZK.

 The basic lessons that I have learned include:

 a) EC2's biggest advantage after scaling and elasticity was conformity
 of
 configuration.  Since you are bringing machines up and down all the
 time,
 they begin to act more like programs and you wind up with boot scripts
 that
 give you a very predictable environment.  Nice.

 b) EC2 interconnect has a lot more going on than in a dedicated VLAN.
  That
 can make the ZK servers appear a bit less connected.  You have to plan
 for
 ConnectionLoss events.

 c) for highest reliability, I switched to large instances.  On
 reflection,
 I
 think that was helpful, but less important than I thought at the time.

 d) increasing and decreasing cluster size is nearly painless and is
 easily
 scriptable.  To decrease, do a rolling update on the survivors to
 update
 their configuration.  Then take down the instance you want to lose.  To
 increase, do a rolling update starting with the new instances to update
 the
 configuration to include all of the machines.  The rolling update
 should
 bounce each ZK with several seconds between each bounce.  Rescaling the
 cluster takes less than a minute which makes it comparable to EC2
 instance
 boot time (about 30 seconds for the Alestic ubuntu instance that we
 used
 plus about 20 seconds for additional configuration).

 On Mon, Jul 6, 2009 at 4:45 AM, David Graf david.g...@28msec.com
 wrote:

  Hello

 I wanna set up a zookeeper ensemble on amazon's ec2 service. In my

  system,

  zookeeper is used to run a locking service and to generate unique
 id's.
 Currently, for testing purposes, I am only running one instance. Now,
 I

  need

  to set up an ensemble to protect my system against crashes.
 The ec2 services has some differences to a normal server farm. E.g.
 the
 data saved on the file system of an ec2 instance is lost if the
 instance
 crashes. In the documentation of zookeeper, I have read that zookeeper

  saves

  snapshots of the in-memory data in the file system. Is that needed for
 recovery? Logically, it would be much easier for me if this is not the

  case.

  Additionally, ec2 brings the advantage that serves can be switch on
 and

  off

  dynamically dependent on the load, traffic, etc. Can this advantage be
 utilized for a zookeeper ensemble? Is it possible to add a zookeeper

  server

  dynamically to an ensemble? E.g. dependent on the in-memory load?

 David






Re: zookeeper on ec2

2009-09-01 Thread Patrick Hunt
Depends on what your tests are. Are they pretty simple/light? then 
probably network issue. Heavy load testing? then might be the 
server/client, might be the network.


easiest thing is to run a ping test while running your zk test and see 
if pings are getting through (and latency). You should also review your 
client/server logs for any information during the CLoss.


Ted Dunning would be a good resource - he runs ZK inside ec2 and has 
alot of experience with it.


Patrick

Satish Bhatti wrote:

For my initial testing I am running with a single ZooKeeper server, i.e. the
ensemble only has one server.  Not sure if this is exacerbating the problem?
 I will check out the trouble shooting link you sent me.

On Tue, Sep 1, 2009 at 5:01 PM, Patrick Hunt ph...@apache.org wrote:


I'm not very familiar with ec2 environment, are you doing any monitoring?
In particular network connectivity btw nodes? Sounds like networking issues
btw nodes (I'm assuming you've also looked at stuff like this
http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting and verified that
you are not swapping (see gc pressure), etc...)

Patrick


Satish Bhatti wrote:


Session timeout is 30 seconds.

On Tue, Sep 1, 2009 at 4:26 PM, Patrick Hunt ph...@apache.org wrote:

 What is your client timeout? It may be too low.

also see this section on handling recoverable errors:
http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling

connection loss in particular needs special care since:
When a ZooKeeper client loses a connection to the ZooKeeper server there
may be some requests in flight; we don't know where they were in their
flight at the time of the connection loss. 

Patrick


Satish Bhatti wrote:

 I have recently started running on EC2 and am seeing quite a few

ConnectionLoss exceptions.  Should I just catch these and retry?  Since
I
assume that eventually, if the shit truly hits the fan, I will get a
SessionExpired?
Satish

On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning ted.dunn...@gmail.com
wrote:

 We have used EC2 quite a bit for ZK.


The basic lessons that I have learned include:

a) EC2's biggest advantage after scaling and elasticity was conformity
of
configuration.  Since you are bringing machines up and down all the
time,
they begin to act more like programs and you wind up with boot scripts
that
give you a very predictable environment.  Nice.

b) EC2 interconnect has a lot more going on than in a dedicated VLAN.
 That
can make the ZK servers appear a bit less connected.  You have to plan
for
ConnectionLoss events.

c) for highest reliability, I switched to large instances.  On
reflection,
I
think that was helpful, but less important than I thought at the time.

d) increasing and decreasing cluster size is nearly painless and is
easily
scriptable.  To decrease, do a rolling update on the survivors to
update
their configuration.  Then take down the instance you want to lose.  To
increase, do a rolling update starting with the new instances to update
the
configuration to include all of the machines.  The rolling update
should
bounce each ZK with several seconds between each bounce.  Rescaling the
cluster takes less than a minute which makes it comparable to EC2
instance
boot time (about 30 seconds for the Alestic ubuntu instance that we
used
plus about 20 seconds for additional configuration).

On Mon, Jul 6, 2009 at 4:45 AM, David Graf david.g...@28msec.com
wrote:

 Hello


I wanna set up a zookeeper ensemble on amazon's ec2 service. In my

 system,

 zookeeper is used to run a locking service and to generate unique

id's.
Currently, for testing purposes, I am only running one instance. Now,
I

 need

 to set up an ensemble to protect my system against crashes.

The ec2 services has some differences to a normal server farm. E.g.
the
data saved on the file system of an ec2 instance is lost if the
instance
crashes. In the documentation of zookeeper, I have read that zookeeper

 saves

 snapshots of the in-memory data in the file system. Is that needed for

recovery? Logically, it would be much easier for me if this is not the

 case.

 Additionally, ec2 brings the advantage that serves can be switch on

and

 off

 dynamically dependent on the load, traffic, etc. Can this advantage be

utilized for a zookeeper ensemble? Is it possible to add a zookeeper

 server

 dynamically to an ensemble? E.g. dependent on the in-memory load?

David







Re: zookeeper on ec2

2009-09-01 Thread Ted Dunning
Can you enable verboseGC and look at the tenuring distribution and times for
GC?



On Tue, Sep 1, 2009 at 5:54 PM, Satish Bhatti cthd2...@gmail.com wrote:

 Parallel/Serial.
 inf...@domu-12-31-39-06-3d-d1:/opt/ir/agent/infact-installs/aaa/infact$
 iostat
 Linux 2.6.18-xenU-ec2-v1.0 (domU-12-31-39-06-3D-D1) 09/01/2009
  _x86_64_

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  66.110.001.542.96   20.309.08

 Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
 sda2460.83   410.02 12458.18   40499322 1230554928
 sdc   0.00 0.00 0.00 96  0
 sda1  0.53 5.01 4.89 495338 482592



 On Tue, Sep 1, 2009 at 5:46 PM, Mahadev Konar maha...@yahoo-inc.com
 wrote:

  Hi satish,
   what GC are you using? Is it ConcurrentMarkSweep or Parallel/Serial?
 
   Also, how is your disk usage on this machine? Can you check your iostat
  numbers?
 
  Thanks
  mahadev
 
 
  On 9/1/09 5:15 PM, Satish Bhatti cthd2...@gmail.com wrote:
 
   GC Time: 11.628 seconds on PS MarkSweep (389 collections)5 minutes on
 PS
   scavenge( 7,636 collections)
  
   It's been running for about 48 hours.
  
  
   On Tue, Sep 1, 2009 at 5:12 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
  
   Do you have long GC delays?
  
   On Tue, Sep 1, 2009 at 4:51 PM, Satish Bhatti cthd2...@gmail.com
  wrote:
  
   Session timeout is 30 seconds.
  
   On Tue, Sep 1, 2009 at 4:26 PM, Patrick Hunt ph...@apache.org
 wrote:
  
   What is your client timeout? It may be too low.
  
   also see this section on handling recoverable errors:
   http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling
  
   connection loss in particular needs special care since:
   When a ZooKeeper client loses a connection to the ZooKeeper server
   there
   may be some requests in flight; we don't know where they were in
 their
   flight at the time of the connection loss. 
  
   Patrick
  
  
   Satish Bhatti wrote:
  
   I have recently started running on EC2 and am seeing quite a few
   ConnectionLoss exceptions.  Should I just catch these and retry?
Since
   I
   assume that eventually, if the shit truly hits the fan, I will get
 a
   SessionExpired?
   Satish
  
   On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning 
 ted.dunn...@gmail.com
   wrote:
  
We have used EC2 quite a bit for ZK.
  
   The basic lessons that I have learned include:
  
   a) EC2's biggest advantage after scaling and elasticity was
   conformity
   of
   configuration.  Since you are bringing machines up and down all
 the
   time,
   they begin to act more like programs and you wind up with boot
   scripts
   that
   give you a very predictable environment.  Nice.
  
   b) EC2 interconnect has a lot more going on than in a dedicated
  VLAN.
That
   can make the ZK servers appear a bit less connected.  You have to
   plan
   for
   ConnectionLoss events.
  
   c) for highest reliability, I switched to large instances.  On
   reflection,
   I
   think that was helpful, but less important than I thought at the
   time.
  
   d) increasing and decreasing cluster size is nearly painless and
 is
   easily
   scriptable.  To decrease, do a rolling update on the survivors to
   update
   their configuration.  Then take down the instance you want to
 lose.
To
   increase, do a rolling update starting with the new instances to
   update
   the
   configuration to include all of the machines.  The rolling update
   should
   bounce each ZK with several seconds between each bounce.
  Rescaling
   the
   cluster takes less than a minute which makes it comparable to EC2
   instance
   boot time (about 30 seconds for the Alestic ubuntu instance that
 we
   used
   plus about 20 seconds for additional configuration).
  
   On Mon, Jul 6, 2009 at 4:45 AM, David Graf david.g...@28msec.com
 
   wrote:
  
Hello
  
   I wanna set up a zookeeper ensemble on amazon's ec2 service. In
 my
  
   system,
  
   zookeeper is used to run a locking service and to generate unique
   id's.
   Currently, for testing purposes, I am only running one instance.
   Now,
   I
  
   need
  
   to set up an ensemble to protect my system against crashes.
   The ec2 services has some differences to a normal server farm.
 E.g.
   the
   data saved on the file system of an ec2 instance is lost if the
   instance
   crashes. In the documentation of zookeeper, I have read that
   zookeeper
  
   saves
  
   snapshots of the in-memory data in the file system. Is that
 needed
   for
   recovery? Logically, it would be much easier for me if this is
 not
   the
  
   case.
  
   Additionally, ec2 brings the advantage that serves can be switch
 on
   and
  
   off
  
   dynamically dependent on the load, traffic, etc. Can this
 advantage
   be
   utilized for a zookeeper ensemble? Is it possible to add a
  zookeeper
  
   server
  
   dynamically to an ensemble? E.g. dependent on the in-memory 

Re: zookeeper on ec2

2009-07-07 Thread Patrick Hunt

Henry Robinson wrote:

Effectively, EC2 does not introduce any new failure modes but potentially
exacerbates some existing ones. If a majority of EC2 nodes fail (in the
sense that their hard drive images cannot be recovered), there is no way to
restart the cluster, and persistence is lost. As you say, this is highly
unlikely. If, for some reason, the quorums are set such that only a single
node failure could bring down the quorum (bad design, but plausible), this
failure is more likely.


This is not strictly true. The cluster cannot recover _automatically_ if 
failures  n, where ensemble size is 2n+1. However you can recover 
manually as long as at least 1 snap and trailing logs can be recovered. 
We can even recover if the latest snapshots are corrupted, as long as we 
can recover a snap from some previous time t and all logs subsequent to t.





EC2 just ups the stakes - crash failures are now potentially more dangerous
(bugs, packet corruption, rack local hardware failures etc all could cause
crash failures). It is common to assume that, notwithstanding a significant
physical event that wipes a number of hard drives, writes that are written
stay written. This assumption is sometimes false given certain choices of
filesystem. EC2 just gives us a few more ways for that not to be true.

I think it's more possible than one might expect to have a lagging minority
left behind - say they are partitioned from the majority by a malfunctioning
switch. They might all be lagging already as a result. Care must be taken
not to bring up another follower on the minority side to make it a majority,
else there are split-brain issues as well as the possibility of lost
transactions. Again, not *too* likely to happen in the wild, but these
permanently running services have a nasty habit of exploring the edge
cases...



To be explicit, you can cause any ZK cluster to back-track in time by doing
the following:


...


f) add new members of the cluster



Which is why care needs to be taken that the ensemble can't be expanded with
a current quorum. Dynamic membership doesn't save us when a majority fails -
the existence of a quorum is a liveness condition for ZK. To help with the
liveness issue we can sacrifice a little safety (see, e.g. vector clock
ordered timestamps in Dynamo), but I think that ZK is aimed at safety first,
liveness second. Not that you were advocating changing that, I'm just
articulating why correctness is extremely important from my perspective.

Henry




At this point, you will have lost the transactions from (b), but I really,
really am not going to worry about this happening either by plan or by
accident.  Without steps (e) and (f), the cluster will tell you that it
knows something is wrong and that it cannot elect a leader.  If you don't
have *exact* coincidence of the survivor set and the set of laggards, then
you won't have any data loss at all.

You have to decide if this is too much risk for you.  My feeling is that it
is OK level of correctness for conventional weapon fire control, but not
for
nuclear weapons safeguards.  Since my apps are considerably less sensitive
than either of those, I am not much worried.




On Mon, Jul 6, 2009 at 12:40 PM, Henry Robinson he...@cloudera.com
wrote:


It seems like there is a
correctness issue: if a majority of servers fail, with the remaining
minority lagging the leader for some reason, won't the ensemble's current
state be forever lost?





Re: zookeeper on ec2

2009-07-06 Thread Gustavo Niemeyer
Hi Ted,

 b) EC2 interconnect has a lot more going on than in a dedicated VLAN.  That
 can make the ZK servers appear a bit less connected.  You have to plan for
 ConnectionLoss events.

Interesting.

 c) for highest reliability, I switched to large instances.  On reflection, I
 think that was helpful, but less important than I thought at the time.

Besides the fact that there are more resources for ZooKeeper, this
likely helps as well because it reduces the number of systems
competing for the real hardware.

 d) increasing and decreasing cluster size is nearly painless and is easily
 scriptable.  To decrease, do a rolling update on the survivors to update
(...)

Quite interesting indeed.  I guess the work that Henry is pushing on
these couple of JIRA tickets will greatly facilitate this.

Do you mind if I ask you a couple of questions on this:

Do you have any kind of performance data about how much load ZK can
take under this environment?

Have you tried to put the log and snapshot files under EBS?

-- 
Gustavo Niemeyer
http://niemeyer.net


Re: zookeeper on ec2

2009-07-06 Thread Evan Jones

On Jul 6, 2009, at 15:40 , Henry Robinson wrote:

This is an interesting way of doing things. It seems like there is a
correctness issue: if a majority of servers fail, with the remaining
minority lagging the leader for some reason, won't the ensemble's  
current
state be forever lost? This is akin to a majority of servers failing  
and
never recovering. ZK relies on the eventual liveness of a majority  
of its

servers; with EC2 it seems possible that that property might not be
satisfied.


I think you are absolutely correct. However, my understanding of EC2  
failure modes is that even though there is no guarantee that a  
particular instance's disk will survive a failure, it is very possible  
to observe EC2 nodes that fail temporarily (such as rebooting). In  
these cases, the instance's disk typically does survive, and when it  
comes back it will have the same contents. It is only permanent EC2  
failures where the disk is gone (eg. hardware failure, or Amazon  
decides to pull it for some other reason).


Thus, this looks a lot like running your own machines in your own data  
center to me. Soft failures will recover, hardware failures won't. The  
only difference is that if you were running the machines yourself, and  
you ran into some weird issue where you had hardware failures across a  
majority of your Zookeeper ensemble, you could physically move the  
disks to recover the state. If this happens in EC2, you will have to  
do some sort of manual repair where you forcibly restart Zookeeper  
using the state of one of the surviving members. Some Zookeeper  
operations may be lost in this case.


However, we are talking about a situation that seems exceedingly rare.  
No matter what kind of system you are running, serious non-recoverable  
failures will happen, so I don't see this to be an impediment for  
running Zookeeper or other quorum systems in EC2.


That said, I haven't run enough EC2 instances for a long enough period  
of time to observe any serious failures or recoveries. If anyone has  
more detailed information, I would love to hear about it.


Evan Jones

--
Evan Jones
http://evanjones.ca/



Re: zookeeper on ec2

2009-07-06 Thread Ted Dunning
On Mon, Jul 6, 2009 at 12:58 PM, Gustavo Niemeyer gust...@niemeyer.netwrote:

  can make the ZK servers appear a bit less connected.  You have to plan
 for
  ConnectionLoss events.

 Interesting.


Note that most of these seem to be related to client issues, especially GC.
If you configure in such a way as to get long pauses, you will see
connection loss.  The default configuration for ZK is for a pretty short (5
seconds) timeout that is pretty easy to exceed with out-of-the-box GC params
on the client side.


  c) for highest reliability, I switched to large instances.  On
 reflection, I
   think that was helpful, but less important than I thought at the time.

 Besides the fact that there are more resources for ZooKeeper, this
 likely helps as well because it reduces the number of systems
 competing for the real hardware.


Yes, but I think that this is less significant than I expected.  Small
instances have pretty dedicated access to their core.  Disk contention is a
bit of an issue, but not much.


  d) increasing and decreasing cluster size is nearly painless and is
 easily
   scriptable.  To decrease, do a rolling update on the survivors to update
 (...)

 Quite interesting indeed.  I guess the work that Henry is pushing on
 these couple of JIRA tickets will greatly facilitate this.


Absolutely.  I was still very surprised at how small the pain is in the
current world.


 Do you have any kind of performance data about how much load ZK can
 take under this environment?


Only barely.  Partly with an eye toward system diagnostics, and partly to
cause ZK to have something to do, I reported a wide swath of data available
from /proc into ZK every few seconds for all of my servers.  This lead to a
few dozen transactions per second and ultimately helped me discover and
understand some of the connection issues for clients.

ZK seemed pretty darned stable through all of this.

The only instability that I saw was caused by excessive amounts of data in
ZK itself.  As I neared the (small) amount of memory I had allocated for Zk
use, I would see servers go into paroxysms of GC, but the cluster
functionality was impaired to a very surprisingly small degree.

Have you tried to put the log and snapshot files under EBS?


No.  I considered it, but I wanted fewer moving parts rather than more.

Doing that would make the intricate and unlikely failure mode that Henry
asked about even less likely, but I don't know if it would increase or
decrease the probability of any kind of failure.

The observed failure modes for ZK in EC2 were completely dominated by our
(my) own failings (such as letting too much data accumulate).


Re: zookeeper on ec2

2009-07-06 Thread Gustavo Niemeyer
Hi again,

(...)
 ZK seemed pretty darned stable through all of this.

Sounds like a nice test, and it's great to hear that ZooKeeper works well there.

 The only instability that I saw was caused by excessive amounts of data in
 ZK itself.  As I neared the (small) amount of memory I had allocated for Zk
 use, I would see servers go into paroxysms of GC, but the cluster
 functionality was impaired to a very surprisingly small degree.

Cool, makes sense.

 No.  I considered it, but I wanted fewer moving parts rather than more.

 Doing that would make the intricate and unlikely failure mode that Henry
 asked about even less likely, but I don't know if it would increase or
 decrease the probability of any kind of failure.

Yeah, I guess it depends a bit on the system architecture too.  If the
system is designed in such a way that ZK is keeping track of
coordination data which must be resumed after a full stop of the
system, having it stored in persistent data would prevent important
loss of information.  If ZK is really just coordinating ephemeral data
(e.g. locks), then if the whole system goes down, it's ok to just
allow it to start up again in an empty state.

 The observed failure modes for ZK in EC2 were completely dominated by our
 (my) own failings (such as letting too much data accumulate).

Details always take a few iterations to get really right.

Thanks for this data Ted.

-- 
Gustavo Niemeyer
http://niemeyer.net


Re: zookeeper on ec2

2009-07-06 Thread Henry Robinson
On Mon, Jul 6, 2009 at 10:16 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 No.  This should not cause data loss.


 As soon as ZK cannot replicate changes to a majority of machines, it
 refuses
 to take any more changes.  This is old ground and is required for
 correctness in the face of network partition.  It is conceivable (barely)
 that *exactly* the minority that were behind were the survivors, but this
 is
 almost equivalent to a complete failure of the cluster choreographed in
 such
 a way that a few nodes come back from the dead just afterwards.  That could
 cause the state to not include some completed transactions to disappear,
 but at this level of massive failure, we have the same issues with any
 cluster.


Effectively, EC2 does not introduce any new failure modes but potentially
exacerbates some existing ones. If a majority of EC2 nodes fail (in the
sense that their hard drive images cannot be recovered), there is no way to
restart the cluster, and persistence is lost. As you say, this is highly
unlikely. If, for some reason, the quorums are set such that only a single
node failure could bring down the quorum (bad design, but plausible), this
failure is more likely.

EC2 just ups the stakes - crash failures are now potentially more dangerous
(bugs, packet corruption, rack local hardware failures etc all could cause
crash failures). It is common to assume that, notwithstanding a significant
physical event that wipes a number of hard drives, writes that are written
stay written. This assumption is sometimes false given certain choices of
filesystem. EC2 just gives us a few more ways for that not to be true.

I think it's more possible than one might expect to have a lagging minority
left behind - say they are partitioned from the majority by a malfunctioning
switch. They might all be lagging already as a result. Care must be taken
not to bring up another follower on the minority side to make it a majority,
else there are split-brain issues as well as the possibility of lost
transactions. Again, not *too* likely to happen in the wild, but these
permanently running services have a nasty habit of exploring the edge
cases...



 To be explicit, you can cause any ZK cluster to back-track in time by doing
 the following:

...


 f) add new members of the cluster


Which is why care needs to be taken that the ensemble can't be expanded with
a current quorum. Dynamic membership doesn't save us when a majority fails -
the existence of a quorum is a liveness condition for ZK. To help with the
liveness issue we can sacrifice a little safety (see, e.g. vector clock
ordered timestamps in Dynamo), but I think that ZK is aimed at safety first,
liveness second. Not that you were advocating changing that, I'm just
articulating why correctness is extremely important from my perspective.

Henry




 At this point, you will have lost the transactions from (b), but I really,
 really am not going to worry about this happening either by plan or by
 accident.  Without steps (e) and (f), the cluster will tell you that it
 knows something is wrong and that it cannot elect a leader.  If you don't
 have *exact* coincidence of the survivor set and the set of laggards, then
 you won't have any data loss at all.

 You have to decide if this is too much risk for you.  My feeling is that it
 is OK level of correctness for conventional weapon fire control, but not
 for
 nuclear weapons safeguards.  Since my apps are considerably less sensitive
 than either of those, I am not much worried.



 On Mon, Jul 6, 2009 at 12:40 PM, Henry Robinson he...@cloudera.com
 wrote:

  It seems like there is a
  correctness issue: if a majority of servers fail, with the remaining
  minority lagging the leader for some reason, won't the ensemble's current
  state be forever lost?