Re: zookeeper on ec2

2009-09-01 Thread Patrick Hunt
What is your client timeout? It may be too low. also see this section on handling recoverable errors: http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling connection loss in particular needs special care since: When a ZooKeeper client loses a connection to the ZooKeeper server there may be

Re: zookeeper on ec2

2009-09-01 Thread Mahadev Konar
Hi Satish, Connectionloss is a little trickier than just retrying blindly. Please read the following sections on this - http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling And the programmers guide: http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperProgrammers.html To learn more

Re: zookeeper on ec2

2009-09-01 Thread Patrick Hunt
I'm not very familiar with ec2 environment, are you doing any monitoring? In particular network connectivity btw nodes? Sounds like networking issues btw nodes (I'm assuming you've also looked at stuff like this http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting and verified that you are

Re: zookeeper on ec2

2009-09-01 Thread Satish Bhatti
For my initial testing I am running with a single ZooKeeper server, i.e. the ensemble only has one server. Not sure if this is exacerbating the problem? I will check out the trouble shooting link you sent me. On Tue, Sep 1, 2009 at 5:01 PM, Patrick Hunt ph...@apache.org wrote: I'm not very

Re: zookeeper on ec2

2009-09-01 Thread Patrick Hunt
Depends on what your tests are. Are they pretty simple/light? then probably network issue. Heavy load testing? then might be the server/client, might be the network. easiest thing is to run a ping test while running your zk test and see if pings are getting through (and latency). You should

Re: zookeeper on ec2

2009-09-01 Thread Ted Dunning
Can you enable verboseGC and look at the tenuring distribution and times for GC? On Tue, Sep 1, 2009 at 5:54 PM, Satish Bhatti cthd2...@gmail.com wrote: Parallel/Serial. inf...@domu-12-31-39-06-3d-d1:/opt/ir/agent/infact-installs/aaa/infact$ iostat Linux 2.6.18-xenU-ec2-v1.0

Re: zookeeper on ec2

2009-07-07 Thread Patrick Hunt
Henry Robinson wrote: Effectively, EC2 does not introduce any new failure modes but potentially exacerbates some existing ones. If a majority of EC2 nodes fail (in the sense that their hard drive images cannot be recovered), there is no way to restart the cluster, and persistence is lost. As you

Re: zookeeper on ec2

2009-07-06 Thread Gustavo Niemeyer
Hi Ted, b) EC2 interconnect has a lot more going on than in a dedicated VLAN.  That can make the ZK servers appear a bit less connected.  You have to plan for ConnectionLoss events. Interesting. c) for highest reliability, I switched to large instances.  On reflection, I think that was

Re: zookeeper on ec2

2009-07-06 Thread Evan Jones
On Jul 6, 2009, at 15:40 , Henry Robinson wrote: This is an interesting way of doing things. It seems like there is a correctness issue: if a majority of servers fail, with the remaining minority lagging the leader for some reason, won't the ensemble's current state be forever lost? This is

Re: zookeeper on ec2

2009-07-06 Thread Ted Dunning
On Mon, Jul 6, 2009 at 12:58 PM, Gustavo Niemeyer gust...@niemeyer.netwrote: can make the ZK servers appear a bit less connected. You have to plan for ConnectionLoss events. Interesting. Note that most of these seem to be related to client issues, especially GC. If you configure in such

Re: zookeeper on ec2

2009-07-06 Thread Gustavo Niemeyer
Hi again, (...) ZK seemed pretty darned stable through all of this. Sounds like a nice test, and it's great to hear that ZooKeeper works well there. The only instability that I saw was caused by excessive amounts of data in ZK itself.  As I neared the (small) amount of memory I had allocated

Re: zookeeper on ec2

2009-07-06 Thread Henry Robinson
On Mon, Jul 6, 2009 at 10:16 PM, Ted Dunning ted.dunn...@gmail.com wrote: No. This should not cause data loss. As soon as ZK cannot replicate changes to a majority of machines, it refuses to take any more changes. This is old ground and is required for correctness in the face of network