What is your client timeout? It may be too low.
also see this section on handling recoverable errors:
http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling
connection loss in particular needs special care since:
When a ZooKeeper client loses a connection to the ZooKeeper server
there may be
Hi Satish,
Connectionloss is a little trickier than just retrying blindly. Please
read the following sections on this -
http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling
And the programmers guide:
http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperProgrammers.html
To learn more
I'm not very familiar with ec2 environment, are you doing any
monitoring? In particular network connectivity btw nodes? Sounds like
networking issues btw nodes (I'm assuming you've also looked at stuff
like this http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting and
verified that you are
For my initial testing I am running with a single ZooKeeper server, i.e. the
ensemble only has one server. Not sure if this is exacerbating the problem?
I will check out the trouble shooting link you sent me.
On Tue, Sep 1, 2009 at 5:01 PM, Patrick Hunt ph...@apache.org wrote:
I'm not very
Depends on what your tests are. Are they pretty simple/light? then
probably network issue. Heavy load testing? then might be the
server/client, might be the network.
easiest thing is to run a ping test while running your zk test and see
if pings are getting through (and latency). You should
Can you enable verboseGC and look at the tenuring distribution and times for
GC?
On Tue, Sep 1, 2009 at 5:54 PM, Satish Bhatti cthd2...@gmail.com wrote:
Parallel/Serial.
inf...@domu-12-31-39-06-3d-d1:/opt/ir/agent/infact-installs/aaa/infact$
iostat
Linux 2.6.18-xenU-ec2-v1.0
Henry Robinson wrote:
Effectively, EC2 does not introduce any new failure modes but potentially
exacerbates some existing ones. If a majority of EC2 nodes fail (in the
sense that their hard drive images cannot be recovered), there is no way to
restart the cluster, and persistence is lost. As you
Hi Ted,
b) EC2 interconnect has a lot more going on than in a dedicated VLAN. That
can make the ZK servers appear a bit less connected. You have to plan for
ConnectionLoss events.
Interesting.
c) for highest reliability, I switched to large instances. On reflection, I
think that was
On Jul 6, 2009, at 15:40 , Henry Robinson wrote:
This is an interesting way of doing things. It seems like there is a
correctness issue: if a majority of servers fail, with the remaining
minority lagging the leader for some reason, won't the ensemble's
current
state be forever lost? This is
On Mon, Jul 6, 2009 at 12:58 PM, Gustavo Niemeyer gust...@niemeyer.netwrote:
can make the ZK servers appear a bit less connected. You have to plan
for
ConnectionLoss events.
Interesting.
Note that most of these seem to be related to client issues, especially GC.
If you configure in such
Hi again,
(...)
ZK seemed pretty darned stable through all of this.
Sounds like a nice test, and it's great to hear that ZooKeeper works well there.
The only instability that I saw was caused by excessive amounts of data in
ZK itself. As I neared the (small) amount of memory I had allocated
On Mon, Jul 6, 2009 at 10:16 PM, Ted Dunning ted.dunn...@gmail.com wrote:
No. This should not cause data loss.
As soon as ZK cannot replicate changes to a majority of machines, it
refuses
to take any more changes. This is old ground and is required for
correctness in the face of network
12 matches
Mail list logo