Re: How to reestablish a session

2010-11-18 Thread Benjamin Reed
ah i see. you are manually reestablishing the connection to B using the 
session identifier for the session with A.


the problem is that when you call close on a session, it kills the 
session. we don't really have a way to close a handle without do that. 
(actually there is a test class that does it in java.)


if you want this, you should open a jira to do a close() without killing 
the session.


why don't you let the client library do the move for you?

ben


On 11/18/2010 11:51 AM, Gustavo Niemeyer wrote:

Hi Ben,


that quote is a bit out of context. it was with respect to a proposed
change.

My point was just that the reasoning why you believed it wasn't a good
approach to kill ephemerals in that old instance applies to the new
cases I'm pointing out.  I wasn't suggesting you agreed with my new
reasoning upfront.


in your scenario can you explain step 4)? what are you closing?

I'm closing the old ZooKeeper handler (zh), after a new one was
established with the same client id.





Re: How to reestablish a session

2010-11-18 Thread Benjamin Reed
oops, sorry camille, i didn't mean to replicate your answer. you 
explained it better than me :)


ben

On 11/18/2010 10:06 AM, Fournier, Camille F. [Tech] wrote:

This is exactly the scenario that you use to test session expiration, make one 
connection to a ZK and then another with the same session and password, and 
close the second connection, which causes the first to expire. It is only a 
clean close that will cause this to happen, though (one where the client calls 
close to end the connection).

Right now, if you have a partition between client and server A, I would not 
expect server A to see a clean close from the client, but one of the various 
exceptions that cause the socket to close. These do not do anything currently 
to change the state of the session, and if the client connects elsewhere before 
the session timeout, the session will remain active.

C


-Original Message-
From: Gustavo Niemeyer [mailto:gust...@niemeyer.net]
Sent: Thursday, November 18, 2010 10:16 AM
To: ZooKeeper Users
Subject: How to reestablish a session

Greetings,

As some of you already know, we've been using ZooKeeper at Canonical
for a project we've been pushing (Ensemble, http://j.mp/dql6Fu).
We've already written down txzookeeper (http://j.mp/d3Zx7z), to
integrate the Python bindings with Twisted, and we're also in the
process of creating a Go binding for the C ZooKeeper library (to be
released soon).

Yesterday, while working on the Go bindings, a test made me wonder
about what's the correct way to reestablish a session with ZooKeeper.

In another thread a couple of months ago, Ben mentioned:


i'm a bit skeptical that this is going to work out properly. a server may
receive a socket reset even though the client is still alive:

1) client sends a request to a server
2) client is partitioned from the server
3) server starts trying to send response
4) client reconnects to a different server
5) partition heals
6) server gets a reset from client

at step 6 i don't think you want to delete the ephemeral nodes.

I also don't think it should delete ephemeral nodes.  While performing
some tests, though, I noticed that something similar to this may
happen.

The following sequence was performed in the test:

1) Establish connection A to ZK
2) Create an ephemeral node with A
3) Establish connection B to ZK, reusing the session from A
4) Close connection A
5) The ephemeral node from (2) got deleted.

So, this made me wonder about what's the proper way to reestablish a
session in practice, due to partitioning. Imagine that the
reconnection which happened on (3) was an attempt from the client to
restore the communication with the ZK cluster when faced with
partitioning.  Once the connection succeeded, the old resources from
connection A should be disposed, but how to do this without risking
killing the healthy connection on B (imagine that the network comes
back between (3) and (4)).

Anyone has thoughts on that?





Re: Running cluster behind load balancer

2010-11-04 Thread Benjamin Reed
one thing to note: the if you are using a DNS load balancer, some load 
balancers will return the list of resolved addresses in different orders 
to do the balancing. the zookeeper client will shuffle that list before 
it it used, so in reality, using a single DNS hostname resolving to all 
the server addresses will probably work just as well as most DNS-based 
load balancers.


ben

On 11/04/2010 08:26 AM, Patrick Hunt wrote:

Hi Chang, thanks for the insights, if you have a few minutes would you
mind updating the FAQ with some of this detail?
http://wiki.apache.org/hadoop/ZooKeeper/FAQ

Thanks!

Patrick

On Thu, Nov 4, 2010 at 6:27 AM, Chang Songtru64...@me.com  wrote:

Sorry. I made a mistake on retry timeout in load balancer section of my answer.
The same timeout applies to load balancer case as well (depends on the recv
timeout)

Thank you

Chang


On Nov 4, 2010, at 10:22 PM, Chang Song wrote:


I would like to add some info on this.

This may not be very important, but there are subtle differences.

Two cases:  1. server hardware failure or kernel panic
  2. zookeeper Java daemon process down

In former one, timeout will be based on the timeout argument in 
zookeeper_init().
Partially based on ZK heartbeat algorithm. It recognize server down in 2/3 of 
the timeout.
then retries at every timeout. For example, if timeout is 9000 msec, it
first times out in 6 second, and retries every 9 seconds.

In latter case (Java process down), since socket connect immediately returns
refused connection, it can retry immediately.

On top of that,

- Hardware load balancer:
If an ensemble cluster is serviced with hardware load balancer,
zookeeper client will retry every 2 second since we only have one IP to try.

- DNS RR:
Make sure that nscd on your linux box is off since it is most likely that DNS 
cache returns the same IP many times.
This is actually worse than above since ZK client will retry the same dead 
server every 2 seconds for some time.


I think it is best not to use load balancer for ZK clients since ZK clients 
will try next server immediately
if previous one fails for some reason (based on timeout above). And this is 
especially true if your cluster works in
pseudo realtime environment where tickTime is set to very low.


Chang


On Nov 4, 2010, at 9:17 AM, Ted Dunning wrote:


DNS round-robin works as well.

On Wed, Nov 3, 2010 at 3:45 PM, Benjamin Reedbr...@yahoo-inc.com  wrote:


it would have to be a TCP based load balancer to work with ZooKeeper
clients, but other than that it should work really well. The clients will be
doing heart beats so the TCP connections will be long lived. The client
library does random connection load balancing anyway.

ben

On 11/03/2010 12:19 PM, Luka Stojanovic wrote:


What would be expected behavior if a three node cluster is put behind a
load
balancer? It would ease deployment because all clients would be configured
to target zookeeper.example.com regardless of actual cluster
configuration,
but I have impression that client-server connection is stateful and that
jumping randomly from server to server could bring strange behavior.

Cheers,

--
Luka Stojanovic
lu...@vast.com
Platform Engineering









Re: Getting a node exists code on a sequence create

2010-11-03 Thread Benjamin Reed

yes, i think you have summarized the problem nicely jeremy.

i'm curious about your reasoning for running servers in standalone mode 
and then merging. can you explain that a bit more?


thanx
ben

On 11/01/2010 04:51 PM, Jeremy Stribling wrote:

I think this is caused by stupid behavior on our application's part, and
the error message just confused me.  Here's what I think is happening.

1) 3 servers are up and accepting data, creating sequential znodes under
/zkrsm.
2) 1 server dies, the other 2 continue creating sequential znodes.
3) The 1st server restarts, but instead of joining the other 2 servers,
it starts an instance by itself, knowing only about the znodes created
before it died.  [This is a bug in our application -- it is supposed to
join the other 2 servers in their cluster.]
4) Another server (#2) dies and restarts, joining the cluster of server
#1.  It knows about more sequential znodes under /zkrsm than server #1.
5) At this point, trying to create a new znode in the #1-#2 cluster
might be problematic, because servers #1 and #2 know about different
sets of znode.  If #1 allocates what it thinks is a new sequential
number for a new znode, it could be one already used by server #2, and
hence a node exists code might be returned.

So, in summary, our application is almost certainly using Zookeeper
wrong.  Sorry to waste time on the list, but maybe this thread can help
someone in the future.

(If this explanation sounds totally off-base though, let me know.  I'm
not 100% certain this is what's happening, but it definitely seems likely.)

Thanks,

Jeremy

On 11/01/2010 02:56 PM, Jeremy Stribling wrote:

Yes, every znode in /zkrsm was created with the sequence flag.

We bring up a cluster of three nodes, though we do it in a slightly
odd manner to support dynamism: each node starts up as a single-node
instance knowing only itself, and then each node is contacted by a
coordinator that kills the ZooKeeperServer object and starts a new
QuorumPeer object using the full list of three servers.  I know this
is weird; perhaps this has something to do with it.

Other than the weird setup behavior, we are just writing a few
sequential records into the system (which all seems to work fine),
killing one of the nodes (one that has been elected leader via the
standard recommended ZK leader election algorithm), restarting it, and
then trying to create more sequential znodes.  I'm guessing this is
pretty well-tested behavior, so there must be something weird or wrong
about the way I have stuff setup.

I'm happy to provide whatever logs or snapshots might help someone
track this down.  Thanks,

Jeremy

On 11/01/2010 02:42 PM, Benjamin Reed wrote:

how were you able to reproduce it?

all the znodes in /zkrsm were created with the sequence flag. right?

ben

On 11/01/2010 02:28 PM, Jeremy Stribling wrote:

We were able to reproduce it.  A stat on all three servers looks
identical:

[zk:ip:port(CONNECTED) 0] stat /zkrsm
cZxid = 9
ctime = Mon Nov 01 13:01:57 PDT 2010
mZxid = 9
mtime = Mon Nov 01 13:01:57 PDT 2010
pZxid = 12884902218
cversion = 177
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0
dataLength = 0
numChildren = 177

Creating a sequential node through the command line also fails:

[zk:ip:port(CONNECTED) 1] create -s /zkrsm/_record
testdata
Node already exists: /zkrsm/_record

One potentially interesting thing is that numChildren above is 177,
though I have sequence numbers on that record prefix up to 214 or so.
There seem to be some gaps though -- I thin ls /zkrsm only shows
about
177.  Not sure if that's relevant or not.

Thanks,

Jeremy

On 11/01/2010 12:06 PM, Jeremy Stribling wrote:

Thanks for the reply.  It happened every time we called create, not
just once.  More than that, we tried restarting each of the nodes in
the system (one-by-one), including the new master, and the problem
continued.

Unfortunately we cleaned everything up, and it's not in that state
anymore.  We haven't yet tried to reproduce, but I will try and report
back if I can get any cversion info.

Jeremy

On 11/01/2010 11:33 AM, Patrick Hunt wrote:

Hi Jeremy, this sounds like a bug to me, I don't think you should be
getting the nodeexists when the sequence flag is set.

Looking at the code briefly we use the parent's cversion
(incremented each time the child list is changed, added/removed).

Did you see this error each time you called create, or just once? If
you look at the cversion in the Stat of the znode /zkrsm on each of
the servers what does it show? You can use the java CLI to connect to
each of your servers and access this information. It would be
interesting to see if the data was out of sync only for a short
period
of time, or forever. Is this repeatable?

Ben/Flavio do you see anything here?

Patrick

On Thu, Oct 28, 2010 at 6:06 PM, Jeremy Striblingst...@nicira.com
wrote:

HI everyone,

Is there any situation in which creating a new ZK node with the
SEQUENCE
flag should result

Re: Getting a node exists code on a sequence create

2010-11-01 Thread Benjamin Reed

how were you able to reproduce it?

all the znodes in /zkrsm were created with the sequence flag. right?

ben

On 11/01/2010 02:28 PM, Jeremy Stribling wrote:

We were able to reproduce it.  A stat on all three servers looks
identical:

[zk:ip:port(CONNECTED) 0] stat /zkrsm
cZxid = 9
ctime = Mon Nov 01 13:01:57 PDT 2010
mZxid = 9
mtime = Mon Nov 01 13:01:57 PDT 2010
pZxid = 12884902218
cversion = 177
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0
dataLength = 0
numChildren = 177

Creating a sequential node through the command line also fails:

[zk:ip:port(CONNECTED) 1] create -s /zkrsm/_record
testdata
Node already exists: /zkrsm/_record

One potentially interesting thing is that numChildren above is 177,
though I have sequence numbers on that record prefix up to 214 or so.
There seem to be some gaps though -- I thin ls /zkrsm only shows about
177.  Not sure if that's relevant or not.

Thanks,

Jeremy

On 11/01/2010 12:06 PM, Jeremy Stribling wrote:

Thanks for the reply.  It happened every time we called create, not
just once.  More than that, we tried restarting each of the nodes in
the system (one-by-one), including the new master, and the problem
continued.

Unfortunately we cleaned everything up, and it's not in that state
anymore.  We haven't yet tried to reproduce, but I will try and report
back if I can get any cversion info.

Jeremy

On 11/01/2010 11:33 AM, Patrick Hunt wrote:

Hi Jeremy, this sounds like a bug to me, I don't think you should be
getting the nodeexists when the sequence flag is set.

Looking at the code briefly we use the parent's cversion
(incremented each time the child list is changed, added/removed).

Did you see this error each time you called create, or just once? If
you look at the cversion in the Stat of the znode /zkrsm on each of
the servers what does it show? You can use the java CLI to connect to
each of your servers and access this information. It would be
interesting to see if the data was out of sync only for a short period
of time, or forever. Is this repeatable?

Ben/Flavio do you see anything here?

Patrick

On Thu, Oct 28, 2010 at 6:06 PM, Jeremy Striblingst...@nicira.com
wrote:

HI everyone,

Is there any situation in which creating a new ZK node with the
SEQUENCE
flag should result in a node exists error?  I'm seeing this happening
after a failure of a ZK node that appeared to have been the master;
when the
new master takes over, my app is unable to create a new SEQUENCE
node under
an existing parent node.  I'm using Zookeeper 3.2.2.

Here's a representative log snippet:

--
3050756 [ProcessThread:-1] TRACE
org.apache.zookeeper.server.PrepRequestProcessor  -
:Psessionid:0x12bf518350f0001 type:create cxid:0x4cca0691
zxid:0xfffe txntype:unknown /zkrsm/_record
3050756 [ProcessThread:-1] WARN
org.apache.zookeeper.server.PrepRequestProcessor  - Got exception when
processing sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691
zxid:0xfffe txntype:unknown n/a
org.apache.zookeeper.KeeperException$NodeExistsException:
KeeperErrorCode =
NodeExists
 at
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:245)

 at
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)

3050756 [ProcessThread:-1] DEBUG
org.apache.zookeeper.server.quorum.CommitProcessor  - Processing
request::
sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691
zxid:0x5027e
txntype:-1 n/a
3050756 [ProcessThread:-1] DEBUG
org.apache.zookeeper.server.quorum.Leader
   - Proposing:: sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691
zxid:0x5027e txntype:-1 n/a
3050756 [SyncThread:0] TRACE
org.apache.zookeeper.server.quorum.Leader  -
Ack zxid: 0x5027e
3050757 [SyncThread:0] TRACE
org.apache.zookeeper.server.quorum.Leader  -
outstanding proposal: 0x5027e
3050757 [SyncThread:0] TRACE
org.apache.zookeeper.server.quorum.Leader  -
outstanding proposals all
3050757 [SyncThread:0] DEBUG
org.apache.zookeeper.server.quorum.Leader  -
Count for zxid: 0x5027e is 1
3050757 [FollowerHandler-/172.16.0.28:48776] TRACE
org.apache.zookeeper.server.quorum.Leader  - Ack zxid: 0x5027e
3050757 [FollowerHandler-/172.16.0.28:48776] TRACE
org.apache.zookeeper.server.quorum.Leader  - outstanding proposal:
0x5027e
3050757 [FollowerHandler-/172.16.0.28:48776] TRACE
org.apache.zookeeper.server.quorum.Leader  - outstanding proposals all
3050757 [FollowerHandler-/172.16.0.28:48776] DEBUG
org.apache.zookeeper.server.quorum.Leader  - Count for zxid:
0x5027e is
2
3050757 [FollowerHandler-/172.16.0.28:48776] DEBUG
org.apache.zookeeper.server.quorum.CommitProcessor  - Committing
request::
sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691
zxid:0x5027e
txntype:-1 n/a
3050757 [CommitProcessor:0] DEBUG
org.apache.zookeeper.server.FinalRequestProcessor  - Processing
request::
sessionid:0x12bf518350f0001 type:create 

Re: Is it possible to read/write a ledger concurrently

2010-10-22 Thread Benjamin Reed
 in hedwig one hub does both the publish and subscribe for a given 
topic and therefore is the only processes reading and writing from/to a 
ledger, so there isn't an issue.


The ReadAheadCache does read-ahead :) it is so that we can minimize 
latency when doing sequential reads.


ben

On 10/21/2010 11:30 PM, amit jaiswal wrote:

Hi,

How does Hedwig handles this scenario? Since only one of the hubs have the
ownership of a topic, the same hub is able to serve both publish and subscribe
requests concurrently. Is my understanding correct ?

Also, what is the purpose of ReadAheadCache class in Hedwig? Is it used
somewhere for this concurrent read/write problem?

-regards
Amit

- Original Message 
From: Benjamin Reedbr...@yahoo-inc.com
To: zookeeper-user@hadoop.apache.org
Sent: Fri, 22 October, 2010 11:09:07 AM
Subject: Re: Is it possible to read/write a ledger concurrently

currently program1 can read and write to an open ledger, but program2 must wait
for the ledger to be closed before doing the read. the problem is that program2
needs to know the last valid entry in the ledger. (there may be entries that may
not yet be valid.) for performance reasons, only program1 knows the end. so you
need a way to propagate that information.

we have talked about a way to push the last entry into the bookkeeper handle.
flavio was working on it, but i don't think it has been implemented.

ben

On 10/21/2010 10:22 PM, amit jaiswal wrote:

Hi,

In BookKeeper documentation, the sample program creates a ledger, writes some
entries and then *closes* the ledger. Then a client program opens the ledger,
and reads the entries from it.

Is it possible for program1 to write to a ledger, and program2 to read from

the

ledger at the same time. In BookKeeper code, if a client tries to read from a
ledger which is not being closed (as per its metadata in zk), then a recovery
process is started to check for consistency.

Waiting for ledger to get closed can introduce lot of latency at the client
side. Can somebody explain this functionality?

-regards
Amit




Re: Is it possible to read/write a ledger concurrently

2010-10-21 Thread Benjamin Reed
 currently program1 can read and write to an open ledger, but program2 
must wait for the ledger to be closed before doing the read. the problem 
is that program2 needs to know the last valid entry in the ledger. 
(there may be entries that may not yet be valid.) for performance 
reasons, only program1 knows the end. so you need a way to propagate 
that information.


we have talked about a way to push the last entry into the bookkeeper 
handle. flavio was working on it, but i don't think it has been implemented.


ben

On 10/21/2010 10:22 PM, amit jaiswal wrote:

Hi,

In BookKeeper documentation, the sample program creates a ledger, writes some
entries and then *closes* the ledger. Then a client program opens the ledger,
and reads the entries from it.

Is it possible for program1 to write to a ledger, and program2 to read from the
ledger at the same time. In BookKeeper code, if a client tries to read from a
ledger which is not being closed (as per its metadata in zk), then a recovery
process is started to check for consistency.

Waiting for ledger to get closed can introduce lot of latency at the client
side. Can somebody explain this functionality?

-regards
Amit




Re: invalid acl for ZOO_CREATOR_ALL_ACL

2010-10-19 Thread Benjamin Reed

 which scheme are you using?

ben

On 10/18/2010 11:57 PM, FANG Yang wrote:

2010/10/19 FANG Yangfa...@douban.com


hi, all
 I have a simple zk client written by c ,which is attachment  #1. When i
use  ZOO_CREATOR_ALL_ACL, the ret code of zoo_create is -114((Invalid ACL
specified definde in zookeeper.h)), but after i replace it with
ZOO_OPEN_ACL_UNSAFE, it work. Zookeeper Programmer's Guide mention that
CREATE_ALL_ACL grants all permissions to the creator of the node. The
creator must have been authenticated by the server (for example, using
“digest” scheme) before it can create nodes with this ACL. I call
zoo_add_auth accroding to func testAuth of TestClient.cc in src/c/tests in
my source code, but it doesn't work, ret code is still -114 . Would you guys
do me a favor, plz?

--
方阳 FANG Yang
开发工程师 Software Engineer
Douban Inc.
msn:franklin.f...@hotmail.com
gtalk:franklin.f...@gmail.com
skype:franklin.fang
No.14 Jiuxianqiao Road, Area 51 A1-1-2016, Beijing 100016 , China
北京市酒仙桥路14号51楼A1区1门2016,100016


   And my zk version is 3.2.2, mode is standalone.






Re: zxid integer overflow

2010-10-19 Thread Benjamin Reed
 we should put in a test for that. it is certainly a plausible 
scenario. in theory it will just flow into the next epoch and everything 
will be fine, but we should try it and see.


ben

On 10/19/2010 11:33 AM, Sandy Pratt wrote:

Just as a thought experiment, I was pondering the following:

ZK stamps each change to its managed state with a zxid 
(http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperInternals.html).  That 
ID consists of a 64 bit number in which the upper 32 bits are the epoch, which 
changes when the leader does, and the bottom 32 bits are a counter, which is 
incremented by the leader with every change.  If 1000 changes are made to ZK 
state each second (which is 1/20th of the peak rate advertised), then the 
counter portion will roll over in 2^32 / (86400 * 1000) = 49 days.

Now, assuming that my math is correct, is this an actual concern?  For example, 
if I'm using ZK to provide locking for a key value store that handles 
transactions at about that rate, am I setting myself up for failure?

Thanks,

Sandy




Re: What does this mean?

2010-10-11 Thread Benjamin Reed
 how big is your data? you may be running into the problem where it 
takes too long to do the state transfer and times out. check the 
initLimit and the size of your data.


ben

On 10/10/2010 08:57 AM, Avinash Lakshman wrote:

Thanks Ben. I am not mixing processes of different clusters. I just double
checked that. I have ZK deployed in a 5 node cluster and I have 20
observers. I just started the 5 node cluster w/o starting the observers. I
still the same issue. Now my cluster won't start up. So what is the correct
workaround to get this going? How can I find out who the leader is and who
the follower to get more insight?

Thanks
A

On Sun, Oct 10, 2010 at 8:33 AM, Benjamin Reedbr...@yahoo-inc.com  wrote:


this usually happens when a follower closes its connection to the leader.
it is usually caused by the follower shutting down or failing. you may get
further insight by looking at the follower logs. you should really run with
timestamps on so that you can correlate the logs of the leader and follower.

on thing that is strange is the wide divergence between zxid of follower
and leader. are you mixing processes of different clusters?

ben


From: Avinash Lakshman [avinash.laksh...@gmail.com]
Sent: Sunday, October 10, 2010 8:18 AM
To: zookeeper-user
Subject: What does this mean?

I see this exception and the servers not doing anything.

java.io.IOException: Channel eof
at

org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:630)
ERROR - 124554051584(higestZxid)  21477836646(next log) for type -11
WARN - Sending snapshot last zxid of peer is 0xe  zxid of leader is
0x1e
WARN - Sending snapshot last zxid of peer is 0x18  zxid of leader
is
0x1eg
  WARN - Sending snapshot last zxid of peer is 0x5002dc766  zxid of leader
is
0x1e
WARN - Sending snapshot last zxid of peer is 0x1c  zxid of leader
is
0x1e
ERROR - Unexpected exception causing shutdown while sock still open
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:78)
at java.io.DataOutputStream.writeInt(DataOutputStream.java:180)
at
org.apache.jute.BinaryOutputArchive.writeInt(BinaryOutputArchive.java:55)
at
org.apache.zookeeper.data.StatPersisted.serialize(StatPersisted.java:116)
at org.apache.zookeeper.server.DataNode.serialize(DataNode.java:167)
at

org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)
at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:967)
at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982)
at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982)
at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982)
at
org.apache.zookeeper.server.DataTree.serialize(DataTree.java:1031)
at

org.apache.zookeeper.server.util.SerializeUtils.serializeSnapshot(SerializeUtils.java:104)
at

org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:426)
at

org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:331)
WARN - *** GOODBYE /10.138.34.212:33272 

Avinash





RE: What does this mean?

2010-10-10 Thread Benjamin Reed
this usually happens when a follower closes its connection to the leader. it is 
usually caused by the follower shutting down or failing. you may get further 
insight by looking at the follower logs. you should really run with timestamps 
on so that you can correlate the logs of the leader and follower.

on thing that is strange is the wide divergence between zxid of follower and 
leader. are you mixing processes of different clusters?

ben


From: Avinash Lakshman [avinash.laksh...@gmail.com]
Sent: Sunday, October 10, 2010 8:18 AM
To: zookeeper-user
Subject: What does this mean?

I see this exception and the servers not doing anything.

java.io.IOException: Channel eof
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:630)
ERROR - 124554051584(higestZxid)  21477836646(next log) for type -11
WARN - Sending snapshot last zxid of peer is 0xe  zxid of leader is
0x1e
WARN - Sending snapshot last zxid of peer is 0x18  zxid of leader is
0x1eg
WARN - Sending snapshot last zxid of peer is 0x5002dc766  zxid of leader is
0x1e
WARN - Sending snapshot last zxid of peer is 0x1c  zxid of leader is
0x1e
ERROR - Unexpected exception causing shutdown while sock still open
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:78)
at java.io.DataOutputStream.writeInt(DataOutputStream.java:180)
at
org.apache.jute.BinaryOutputArchive.writeInt(BinaryOutputArchive.java:55)
at
org.apache.zookeeper.data.StatPersisted.serialize(StatPersisted.java:116)
at org.apache.zookeeper.server.DataNode.serialize(DataNode.java:167)
at
org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)
at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:967)
at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982)
at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982)
at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982)
at
org.apache.zookeeper.server.DataTree.serialize(DataTree.java:1031)
at
org.apache.zookeeper.server.util.SerializeUtils.serializeSnapshot(SerializeUtils.java:104)
at
org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:426)
at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:331)
WARN - *** GOODBYE /10.138.34.212:33272 

Avinash


Re: Question on production readiness, deployment, data of BookKeeper / Hedwig

2010-10-08 Thread Benjamin Reed
 your guess is correct :) for bookkeeper and hedwig we released early 
to do the development in public. originally we developed bookkeeper as a 
distributed write ahead log for the NameNode in HDFS, but while we were 
able to get a proof of concept going, the structure of the code of the 
NameNode makes it difficulty to integrate well. we are currently working 
on fixing the write ahead layer of the NameNode, which is taking a lot 
of time. in the meantime we applied bookkeeper to pub/sub and came up 
with hedwig, which is where most of our efforts are focused while the 
slow processing of pushing changes to the NameNode proceeds.


ben

On 10/08/2010 02:32 PM, Jake Mannix wrote:

Hi Ben,

   To follow up with this question, which seems to be asking primarily about
Hedwig (and I guess the answer is: it's not in production yet, anywhere),
with one more about Bookkeeper: is BookKeeper used in production as a WAL
(or for any other use) anywhere?  If so, for what uses?

   Any info (even anecdotal) would be great!

   -jake

On Thu, Oct 7, 2010 at 9:15 AM, Benjamin Reedbr...@yahoo-inc.com  wrote:


  hi amit,

sorry for the late response. this week has been crunch time for a lot of
different things.

here are your answers:

production

1. it is still in prototype phase. we are evaluating different aspects, but
there is still some work to do to make it production ready. we also need to
get an engineering team to signup to stand behind it.

2. it's a generic pub/sub message bus. in some sense it is really a
datacenter solution with extensions for multi-data center operation, so it
is perfectly reasonable to use it in a single datacenter setting.

3. yeah, we have removed the hw.bash script. it had some hardcoded
assumptions and was a swiss army knife on steroids. he have been breaking it
up into simpler scripts.

4. session expiry really represents a fundamental connectivity problem, so
both bk and hedwig restart the component that gets the expired session
errror.

data

1. yes.

2. once all subscribers have consumed a message there is a background
process that cleans it up.

3. yes there is a replication factor and we ensure replication on writes
and there is a recovery tool to recover bookies that fail. we don't have to
worry about conflicts because there is only a single writer for a give
ledger. because of this we do not need to do quorum reads.

documentation

yes, this is something we need to work on. i'll see if i can push out some
of our hello world applications. we'd also like to put a JMS API on top so
that the API is more familiar (and documented :). i don't want to delay the
answers to your other questions, so let me answer that HedwigSubscriber is
the class for clients. the other classes are internal. (for cross data
center hubs use a special kind of subscriptions to do cross data center
updates.)

ben

On 10/05/2010 10:32 PM, amit jaiswal wrote:


Hi,

In Hedwig talk (http://vimeo.com/13282102), it was mentioned that the
primary
use case for Hedwig comes from the distributed key-value store PNUTS in
Yahoo!,
but also said that the work is new.

Could you please about the following:

Production readiness / Deployment
1. What is the production readiness of Hedwig / BookKeeper. Is it being
used
anywhere (like in PNUTS)?
2. Is Hedwig designed to use as a generic message bus or only for
multi-datacenter operations?
3. Hedwig installation and deployment is done through a script hw.bash,
but that
is difficult to use especially in a production environment. Are there any
other
packages available that can simplify the deployment of hedwig.
4. How does BK/Hedwig handle zookeeper session expiry?

Data Deletion, Handling data loss, Quorum
1. Does BookKeeper support deletion of old log entries which have been
consumed.
2. How does Hedwig handles the case when all subscribers have consumed all
the
messages. In the talk, it was said that a subscriber can come back after
hours,
days or weeks. Is there any data retention / expiration policy for the
data that
is published?
3. How does Hedwig handles data loss? There is a replication factor, and a
write
operation must be accepted by majority of the bookies, but how data
conflicts
are handled? Is there any possibility of data conflict at all? Is the
replication only for recovery? When the hub is reading data from bookies,
does
it reads from all the bookies to satisfy quorum read?

Code
What is the difference between PubSubServer, HedwigSubscriber,
HedwigHubSubscriber. Is there any HelloWorld program that simply
illustrates how
to instantiate a hedwig client, and publish/consume messages.
(HedwigBenchmark
class is helpful, but was looking something like API documentation).

-regards
Amit







Re: Question on production readiness, deployment, data of BookKeeper / Hedwig

2010-10-07 Thread Benjamin Reed

 hi amit,

sorry for the late response. this week has been crunch time for a lot of 
different things.


here are your answers:

production

1. it is still in prototype phase. we are evaluating different aspects, 
but there is still some work to do to make it production ready. we also 
need to get an engineering team to signup to stand behind it.


2. it's a generic pub/sub message bus. in some sense it is really a 
datacenter solution with extensions for multi-data center operation, so 
it is perfectly reasonable to use it in a single datacenter setting.


3. yeah, we have removed the hw.bash script. it had some hardcoded 
assumptions and was a swiss army knife on steroids. he have been 
breaking it up into simpler scripts.


4. session expiry really represents a fundamental connectivity problem, 
so both bk and hedwig restart the component that gets the expired 
session errror.


data

1. yes.

2. once all subscribers have consumed a message there is a background 
process that cleans it up.


3. yes there is a replication factor and we ensure replication on writes 
and there is a recovery tool to recover bookies that fail. we don't have 
to worry about conflicts because there is only a single writer for a 
give ledger. because of this we do not need to do quorum reads.


documentation

yes, this is something we need to work on. i'll see if i can push out 
some of our hello world applications. we'd also like to put a JMS API on 
top so that the API is more familiar (and documented :). i don't want to 
delay the answers to your other questions, so let me answer that 
HedwigSubscriber is the class for clients. the other classes are 
internal. (for cross data center hubs use a special kind of 
subscriptions to do cross data center updates.)


ben

On 10/05/2010 10:32 PM, amit jaiswal wrote:

Hi,

In Hedwig talk (http://vimeo.com/13282102), it was mentioned that the primary
use case for Hedwig comes from the distributed key-value store PNUTS in Yahoo!,
but also said that the work is new.

Could you please about the following:

Production readiness / Deployment
1. What is the production readiness of Hedwig / BookKeeper. Is it being used
anywhere (like in PNUTS)?
2. Is Hedwig designed to use as a generic message bus or only for
multi-datacenter operations?
3. Hedwig installation and deployment is done through a script hw.bash, but that
is difficult to use especially in a production environment. Are there any other
packages available that can simplify the deployment of hedwig.
4. How does BK/Hedwig handle zookeeper session expiry?

Data Deletion, Handling data loss, Quorum
1. Does BookKeeper support deletion of old log entries which have been consumed.
2. How does Hedwig handles the case when all subscribers have consumed all the
messages. In the talk, it was said that a subscriber can come back after hours,
days or weeks. Is there any data retention / expiration policy for the data that
is published?
3. How does Hedwig handles data loss? There is a replication factor, and a write
operation must be accepted by majority of the bookies, but how data conflicts
are handled? Is there any possibility of data conflict at all? Is the
replication only for recovery? When the hub is reading data from bookies, does
it reads from all the bookies to satisfy quorum read?

Code
What is the difference between PubSubServer, HedwigSubscriber,
HedwigHubSubscriber. Is there any HelloWorld program that simply illustrates how
to instantiate a hedwig client, and publish/consume messages. (HedwigBenchmark
class is helpful, but was looking something like API documentation).

-regards
Amit




Re: Zookeeper on 60+Gb mem

2010-10-05 Thread Benjamin Reed
 you will need to time how long it takes to read all that state back in 
and adjust the initTime accordingly. it will probably take a while to 
pull all that data into memory.


ben

On 10/05/2010 11:36 AM, Avinash Lakshman wrote:

I have run it over 5 GB of heap with over 10M znodes. We will definitely run
it with over 64 GB of heap. Technically I do not see any limitiation.
However I will the experts chime in.

Avinash

On Tue, Oct 5, 2010 at 11:14 AM, Mahadev Konarmaha...@yahoo-inc.comwrote:


Hi Maarteen,
  I definitely know of a group which uses around 3GB of memory heap for
zookeeper but never heard of someone with such huge requirements. I would
say it definitely would be a learning experience with such high memory
which
I definitely think would be very very useful for others in the community as
well.

Thanks
mahadev


On 10/5/10 11:03 AM, Maarten Koopmansmaar...@vrijheid.net  wrote:


Hi,

I just wondered: has anybody ever ran zookeeper to the max on a 68GB
quadruple extra large high memory EC2 instance? With, say, 60GB allocated

or

so?

Because EC2 with EBS is a nice way to grow your zookeeper cluster (data

on the

ebs columes, upgrade as your memory utilization grows)  - I just

wonder

what the limits are there, or if I am foing where angels fear to tread...

--Maarten







Re: ZK compatability

2010-10-01 Thread Benjamin Reed
 we should also point out that our ops guys here at yahoo! don't like 
the break at major clause. i imagine when we do the next major release 
we will try to be one release backwards compatible. (although we 
shouldn't promise it until we successfully do it once :)


ben

On 09/30/2010 10:29 AM, Patrick Hunt wrote:

Historically major releases can have non-bw compatible changes.  However if
you look back through the release history you'll see that the last time that
happened was oct 2008, when we moved the project from sourceforge to apache.

Patrick

On Tue, Sep 28, 2010 at 11:37 AM, Jun Raojun...@gmail.com  wrote:


What about major releases going forward? Thanks,

Jun

On Mon, Sep 27, 2010 at 10:32 PM, Patrick Huntph...@apache.org  wrote:


In general yes, minor and bug fix releases are fully backward compatible.

Patrick


On Sun, Sep 26, 2010 at 9:11 PM, Jun Raojun...@gmail.com  wrote:


Hi,

Does ZK support (and plan to support in the future) backward

compatibility

(so that a new client can talk to an old server and vice versa)?

Thanks

Jun







Re: closing session on socket close vs waiting for timeout

2010-09-10 Thread Benjamin Reed
 the problem is that followers don't track session timeouts. they track 
when they last heard from the sessions that are connected to them and 
they periodically propagate this information to the leader. the leader 
is the one that expires the session. your technique only works when the 
client is connected to the leader.


one thing you can do is generate a close request for the socket and push 
that through the system. that will cause it to get propagated through 
the followers and processed at the leader. it would also allow you to 
get your functionality without touching the processing pipeline.


the thing that worries me about this functionality in general is that 
network anomalies can cause a whole raft of sessions to get expired in 
this way. for example, you have 3 servers with load spread well; there 
is a networking glitch that cause clients to abandon a server; suddenly 
1/3 of your clients will get expired sessions.


ben

On 09/10/2010 12:17 PM, Fournier, Camille F. [Tech] wrote:

Ben, could you explain a bit more why you think this won't work? I'm trying to 
decide if I should put in the work to take the POC I wrote and complete it, but 
I don't really want to waste my time if there's a fundamental reason it's a bad 
idea.

Thanks,
Camille

-Original Message-
From: Benjamin Reed [mailto:br...@yahoo-inc.com]
Sent: Wednesday, September 08, 2010 4:03 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

unfortunately, that only works on the standalone server.

ben

On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:

This would be the ideal solution to this problem I think.
Poking around the (3.3) code to figure out how hard it would be to implement, I 
figure one way to do it would be to modify the session timeout to the min 
session timeout and touch the connection before calling close when you get 
certain exceptions in NIOServerCnxn.doIO. I did this (removing the code in 
touch session that returns if the tickTime is greater than the expire time) and 
it worked (in the standalone server anyway). Interesting solution, or total 
hack that will not work beyond most basic test case?

C

(forgive lack of actual code in this email)

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Tuesday, September 07, 2010 1:11 PM
To: zookeeper-user@hadoop.apache.org
Cc: Benjamin Reed
Subject: Re: closing session on socket close vs waiting for timeout

This really is, just as Ben says a problem of false positives and false
negatives in detecting session
expiration.

On the other hand, the current algorithm isn't really using all the
information available.  The current algorithm is
using time since last client initiated heartbeat.  The new proposal is
somewhat worse in that it proposes to use
just the boolean has-TCP-disconnect-happened.

Perhaps it would be better to use multiple features in order to decrease
both false positives and false negatives.

For instance, I could imagine that we use the following features:

- time since last client hearbeat or disconnect or reconnect

- what was the last event? (a heartbeat or a disconnect or a reconnect)

Then the expiration algorithm could use a relatively long time since last
heartbeat and a relatively short time since last disconnect to mark a
session as disconnected.

Wouldn't this avoid expiration during GC and cluster partition and cause
expiration quickly after a client disconnect?


On Mon, Sep 6, 2010 at 11:26 PM, Patrick Huntph...@apache.org   wrote:



That's a good point, however with suitable documentation, warnings and such
it seems like a reasonable feature to provide for those users who require
it. Used in moderation it seems fine to me. Perhaps we also make it
configurable at the server level for those administrators/ops who don't
want
to deal with it (disable the feature entirely, or only enable on particular
servers, etc...).

Patrick

On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reedbr...@yahoo-inc.com   wrote:



if this mechanism were used very often, we would get a huge number of
session expirations when a server fails. you are trading fast error
detection for the ability to tolerate temporary network and server


outages.


to be honest this seems like something that in theory sounds like it will
work in practice, but once deployed we start getting session expirations


for


cases that we really do not want or expect.

ben


On 09/01/2010 12:47 PM, Patrick Hunt wrote:



Ben, in this case the session would be tied directly to the connection,
we'd explicitly deny session re-establishment for this session type (so
4 would fail). Would that address your concern, others?

Patrick

On 09/01/2010 10:03 AM, Benjamin Reed wrote:




i'm a bit skeptical that this is going to work out properly. a server
may receive a socket reset even though the client is still alive:

1) client sends a request to a server
2) client is partitioned from the server
3

Re: closing session on socket close vs waiting for timeout

2010-09-10 Thread Benjamin Reed
 ah dang, i should have said generate a close request for the session 
and push that through the system.


ben

On 09/10/2010 01:01 PM, Benjamin Reed wrote:

   the problem is that followers don't track session timeouts. they track
when they last heard from the sessions that are connected to them and
they periodically propagate this information to the leader. the leader
is the one that expires the session. your technique only works when the
client is connected to the leader.

one thing you can do is generate a close request for the socket and push
that through the system. that will cause it to get propagated through
the followers and processed at the leader. it would also allow you to
get your functionality without touching the processing pipeline.

the thing that worries me about this functionality in general is that
network anomalies can cause a whole raft of sessions to get expired in
this way. for example, you have 3 servers with load spread well; there
is a networking glitch that cause clients to abandon a server; suddenly
1/3 of your clients will get expired sessions.

ben

On 09/10/2010 12:17 PM, Fournier, Camille F. [Tech] wrote:

Ben, could you explain a bit more why you think this won't work? I'm trying to 
decide if I should put in the work to take the POC I wrote and complete it, but 
I don't really want to waste my time if there's a fundamental reason it's a bad 
idea.

Thanks,
Camille

-Original Message-
From: Benjamin Reed [mailto:br...@yahoo-inc.com]
Sent: Wednesday, September 08, 2010 4:03 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

unfortunately, that only works on the standalone server.

ben

On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:

This would be the ideal solution to this problem I think.
Poking around the (3.3) code to figure out how hard it would be to implement, I 
figure one way to do it would be to modify the session timeout to the min 
session timeout and touch the connection before calling close when you get 
certain exceptions in NIOServerCnxn.doIO. I did this (removing the code in 
touch session that returns if the tickTime is greater than the expire time) and 
it worked (in the standalone server anyway). Interesting solution, or total 
hack that will not work beyond most basic test case?

C

(forgive lack of actual code in this email)

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Tuesday, September 07, 2010 1:11 PM
To: zookeeper-user@hadoop.apache.org
Cc: Benjamin Reed
Subject: Re: closing session on socket close vs waiting for timeout

This really is, just as Ben says a problem of false positives and false
negatives in detecting session
expiration.

On the other hand, the current algorithm isn't really using all the
information available.  The current algorithm is
using time since last client initiated heartbeat.  The new proposal is
somewhat worse in that it proposes to use
just the boolean has-TCP-disconnect-happened.

Perhaps it would be better to use multiple features in order to decrease
both false positives and false negatives.

For instance, I could imagine that we use the following features:

- time since last client hearbeat or disconnect or reconnect

- what was the last event? (a heartbeat or a disconnect or a reconnect)

Then the expiration algorithm could use a relatively long time since last
heartbeat and a relatively short time since last disconnect to mark a
session as disconnected.

Wouldn't this avoid expiration during GC and cluster partition and cause
expiration quickly after a client disconnect?


On Mon, Sep 6, 2010 at 11:26 PM, Patrick Huntph...@apache.orgwrote:



That's a good point, however with suitable documentation, warnings and such
it seems like a reasonable feature to provide for those users who require
it. Used in moderation it seems fine to me. Perhaps we also make it
configurable at the server level for those administrators/ops who don't
want
to deal with it (disable the feature entirely, or only enable on particular
servers, etc...).

Patrick

On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reedbr...@yahoo-inc.comwrote:



if this mechanism were used very often, we would get a huge number of
session expirations when a server fails. you are trading fast error
detection for the ability to tolerate temporary network and server


outages.


to be honest this seems like something that in theory sounds like it will
work in practice, but once deployed we start getting session expirations


for


cases that we really do not want or expect.

ben


On 09/01/2010 12:47 PM, Patrick Hunt wrote:



Ben, in this case the session would be tied directly to the connection,
we'd explicitly deny session re-establishment for this session type (so
4 would fail). Would that address your concern, others?

Patrick

On 09/01/2010 10:03 AM, Benjamin Reed wrote:




i'm a bit skeptical that this is going to work out properly. a server
may

Re: closing session on socket close vs waiting for timeout

2010-09-08 Thread Benjamin Reed

unfortunately, that only works on the standalone server.

ben

On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:

This would be the ideal solution to this problem I think.
Poking around the (3.3) code to figure out how hard it would be to implement, I 
figure one way to do it would be to modify the session timeout to the min 
session timeout and touch the connection before calling close when you get 
certain exceptions in NIOServerCnxn.doIO. I did this (removing the code in 
touch session that returns if the tickTime is greater than the expire time) and 
it worked (in the standalone server anyway). Interesting solution, or total 
hack that will not work beyond most basic test case?

C

(forgive lack of actual code in this email)

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Tuesday, September 07, 2010 1:11 PM
To: zookeeper-user@hadoop.apache.org
Cc: Benjamin Reed
Subject: Re: closing session on socket close vs waiting for timeout

This really is, just as Ben says a problem of false positives and false
negatives in detecting session
expiration.

On the other hand, the current algorithm isn't really using all the
information available.  The current algorithm is
using time since last client initiated heartbeat.  The new proposal is
somewhat worse in that it proposes to use
just the boolean has-TCP-disconnect-happened.

Perhaps it would be better to use multiple features in order to decrease
both false positives and false negatives.

For instance, I could imagine that we use the following features:

- time since last client hearbeat or disconnect or reconnect

- what was the last event? (a heartbeat or a disconnect or a reconnect)

Then the expiration algorithm could use a relatively long time since last
heartbeat and a relatively short time since last disconnect to mark a
session as disconnected.

Wouldn't this avoid expiration during GC and cluster partition and cause
expiration quickly after a client disconnect?


On Mon, Sep 6, 2010 at 11:26 PM, Patrick Huntph...@apache.org  wrote:

   

That's a good point, however with suitable documentation, warnings and such
it seems like a reasonable feature to provide for those users who require
it. Used in moderation it seems fine to me. Perhaps we also make it
configurable at the server level for those administrators/ops who don't
want
to deal with it (disable the feature entirely, or only enable on particular
servers, etc...).

Patrick

On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reedbr...@yahoo-inc.com  wrote:

 

if this mechanism were used very often, we would get a huge number of
session expirations when a server fails. you are trading fast error
detection for the ability to tolerate temporary network and server
   

outages.
 

to be honest this seems like something that in theory sounds like it will
work in practice, but once deployed we start getting session expirations
   

for
 

cases that we really do not want or expect.

ben


On 09/01/2010 12:47 PM, Patrick Hunt wrote:

   

Ben, in this case the session would be tied directly to the connection,
we'd explicitly deny session re-establishment for this session type (so
4 would fail). Would that address your concern, others?

Patrick

On 09/01/2010 10:03 AM, Benjamin Reed wrote:


 

i'm a bit skeptical that this is going to work out properly. a server
may receive a socket reset even though the client is still alive:

1) client sends a request to a server
2) client is partitioned from the server
3) server starts trying to send response
4) client reconnects to a different server
5) partition heals
6) server gets a reset from client

at step 6 i don't think you want to delete the ephemeral nodes.

ben

On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:


   

Yes that's right. Which network issues can cause the socket to close
without the initiating process closing the socket? In my limited
experience in this area network issues were more prone to leave dead
sockets open rather than vice versa so I don't know what to look out
for.

Thanks,
Camille

-Original Message-
From: Dave Wright [mailto:wrig...@gmail.com]
Sent: Tuesday, August 31, 2010 1:14 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

I think he's saying that if the socket closes because of a crash (i.e.
not a
normal zookeeper close request) then the session stays alive until the
session timeout, which is of course true since ZK allows reconnection
and
resumption of the session in case of disconnect due to network issues.

-Dave Wright

On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com
wrote:



 

That doesn't sound right to me.

Is there a Zookeeper expert in the house?

On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]
camille.fourn...@gs.com   wrote:



   

I foolishly did not investigate the ZK code closely enough and it
seems

Re: closing session on socket close vs waiting for timeout

2010-09-06 Thread Benjamin Reed
if this mechanism were used very often, we would get a huge number of 
session expirations when a server fails. you are trading fast error 
detection for the ability to tolerate temporary network and server outages.


to be honest this seems like something that in theory sounds like it 
will work in practice, but once deployed we start getting session 
expirations for cases that we really do not want or expect.


ben

On 09/01/2010 12:47 PM, Patrick Hunt wrote:

Ben, in this case the session would be tied directly to the connection,
we'd explicitly deny session re-establishment for this session type (so
4 would fail). Would that address your concern, others?

Patrick

On 09/01/2010 10:03 AM, Benjamin Reed wrote:
   

i'm a bit skeptical that this is going to work out properly. a server
may receive a socket reset even though the client is still alive:

1) client sends a request to a server
2) client is partitioned from the server
3) server starts trying to send response
4) client reconnects to a different server
5) partition heals
6) server gets a reset from client

at step 6 i don't think you want to delete the ephemeral nodes.

ben

On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:
 

Yes that's right. Which network issues can cause the socket to close
without the initiating process closing the socket? In my limited
experience in this area network issues were more prone to leave dead
sockets open rather than vice versa so I don't know what to look out for.

Thanks,
Camille

-Original Message-
From: Dave Wright [mailto:wrig...@gmail.com]
Sent: Tuesday, August 31, 2010 1:14 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

I think he's saying that if the socket closes because of a crash (i.e.
not a
normal zookeeper close request) then the session stays alive until the
session timeout, which is of course true since ZK allows reconnection and
resumption of the session in case of disconnect due to network issues.

-Dave Wright

On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com
wrote:

   

That doesn't sound right to me.

Is there a Zookeeper expert in the house?

On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]
camille.fourn...@gs.com  wrote:

 

I foolishly did not investigate the ZK code closely enough and it seems
that closing the socket still waits for the session timeout to
remove the
session.
   
 




Re: closing session on socket close vs waiting for timeout

2010-09-01 Thread Benjamin Reed
i'm a bit skeptical that this is going to work out properly. a server 
may receive a socket reset even though the client is still alive:


1) client sends a request to a server
2) client is partitioned from the server
3) server starts trying to send response
4) client reconnects to a different server
5) partition heals
6) server gets a reset from client

at step 6 i don't think you want to delete the ephemeral nodes.

ben

On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:

Yes that's right. Which network issues can cause the socket to close without 
the initiating process closing the socket? In my limited experience in this 
area network issues were more prone to leave dead sockets open rather than vice 
versa so I don't know what to look out for.

Thanks,
Camille

-Original Message-
From: Dave Wright [mailto:wrig...@gmail.com]
Sent: Tuesday, August 31, 2010 1:14 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

I think he's saying that if the socket closes because of a crash (i.e. not a
normal zookeeper close request) then the session stays alive until the
session timeout, which is of course true since ZK allows reconnection and
resumption of the session in case of disconnect due to network issues.

-Dave Wright

On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com  wrote:

   

That doesn't sound right to me.

Is there a Zookeeper expert in the house?

On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]
camille.fourn...@gs.com  wrote:

 

I foolishly did not investigate the ZK code closely enough and it seems
that closing the socket still waits for the session timeout to remove the
session.
   
 




Re: Session expiration caused by time change

2010-08-20 Thread Benjamin Reed
i put up a patch that should address the problem. now i need to write a 
test case. the only way i can think of is to change the call to 
System.currentTimeMillis to a utility class that calls 
System.currentTimeMillis that i can mock for testing. any better ideas?


ben

On 08/19/2010 03:53 PM, Ted Dunning wrote:

Put in a four letter command that will put the server to sleep for 15
seconds!

:-)

On Thu, Aug 19, 2010 at 3:51 PM, Benjamin Reedbr...@yahoo-inc.com  wrote:

   

i'm updating ZOOKEEPER-366 with this discussion and try to get a patch out.
Qing (or anyone else, can you reproduce it pretty easily?)

 




Re: Session expiration caused by time change

2010-08-19 Thread Benjamin Reed
yes, you are right. we could do this. it turns out that the expiration 
code is very simple:


while (running) {
currentTime = System.currentTimeMillis();
if (nextExpirationTime  currentTime) {
this.wait(nextExpirationTime - currentTime);
continue;
}
SessionSet set;
set = sessionSets.remove(nextExpirationTime);
if (set != null) {
for (SessionImpl s : set.sessions) {
sessionsById.remove(s.sessionId); 
expirer.expire(s);

}
}
nextExpirationTime += expirationInterval;
}

so we can detect a jump very easily: if nextExpirationTime  
currentTime, we have jumped ahead in time.


now the question is, what do we do with this information?

option 1) we could figure out the jump (nextExpirationTime-currentTime 
is a good estimate) and move all of the sessions forward by that amount.
option 2) we could converge on the time by having a policy to always 
wait at least a half a tick time.


there probably are other options as well. i kind of like option 2. worst 
case is it will make the sessions expire in half the time that they 
should, but this shouldn't be too much of a problem since clients send a 
ping if they are idle for 1/3 of their session timeout.


ben

On 08/19/2010 08:39 AM, Ted Dunning wrote:

True.  But it knows that there has been a jump.

Quiet time can be distinguished from clock shift by assuming that members of
the cluster
don't all jump at the same time.

I would imagine that a recent clock jump estimate could be kept and
buckets that would
otherwise expire due to such a jump could be given a bit of a second lease
on life, delaying
all of their expiration.  Since time-outs are relatively short, the server
would be able to forget
about the bump very shortly.

On Thu, Aug 19, 2010 at 8:22 AM, Benjamin Reedbr...@yahoo-inc.com  wrote:

   

if we try to use network messages to detect and correct the situation, it
seems like we would recreate the problem we are having with ntp, since that
is exactly what it does.

 




Re: Session expiration caused by time change

2010-08-19 Thread Benjamin Reed
if we can't rely on the clock, we cannot say things like if ... for 5 
seconds.


also, clients connect to servers, not visa-versa, so we cannot say 
things like server can attempt to reconnect.


ben

On 08/19/2010 10:17 AM, Vishal K wrote:

Hi Ted,

I haven't give it a serious thought yet, but I don't think it is neccessary
for the cluster to keep track of time.

A node can make its own decision. For the sake of argument, lets say that we
have a client and a server with following policy:
1. Client is supposed to send a ping to server every 1 sec.
2. If server does not hear from client for 5 seconds, then the server
declares that the client is dead.
3. Similary if the client cannot communicate with the server for 5 seconds
client declares that the server is dead.

If the client receives a timeout (say while doing some IO) because of a time
jump, it should check the number of pings that has failed with the server.
If the number is 5, then this is a true failure, If the number is less than
5, then this is because of a time drift.

At the server side, the server can attempt to reconnect (or send a ping to
the client) after it receives a timeout. Thus, if the timeout occured
because of time drift, the server will reconnect and continue. We should
ofcourse have an upper bound in number of retries, etc.

For ZK, it is important to handle time jumps on ZK leader.

   

I believe that the pattern of these problems is a slow slippage behind and
a
sudden jump forward.

 


You won't see the slippage. You will mainly see a jump forward. Note with
large enough number of nodes, multiple nodes could see their time jumping
forward. Therefore, checking comparing time between two servers may not
help.


   

On Thu, Aug 19, 2010 at 7:51 AM, Vishal Kvishalm...@gmail.com  wrote:

 

Hi,

I remember Ben had opened a jira for clock jumps earlier:
https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon
   

to
 

have clocks jump forward in virtualized environments.

It is desirable to modify ZooKeeper to handle this situation (as much as
possible) internally. It would need to be done for both client - server
connections and server - server connections. One obvious solution is to
retry a few times (send ping) after getting a timeout. Another way is to
count the number of pings that have been sent after receiving the
   

timeout.
 

If number of pings do not match the expected number (say 5 ping attempt
should be finished for a 5 sec timeout), then wait till all the pings are
finished. In effect do not completely rely on the clock. Any comments?

-Vishal

On Thu, Aug 19, 2010 at 3:52 AM, Qing Yanqing...@gmail.com  wrote:

   

Oh.. our servers are also running in a virtualized environment.

On Thu, Aug 19, 2010 at 2:58 PM, Martin Waitewaite@gmail.com
 

wrote:
   
 

Hi,

I have tripped over similar problems testing Red Hat Cluster in
   

virtualised
 

environments.  I don't know whether recent linux kernels have
   

improved
 

their
interaction with VMWare, but in our environments clock drift caused
   

by
 

lost
 

ticks can be substantial, requiring NTP to sometimes jump the clock
   

rather
 

than control acceleration.   In one of our internal production rigs,
   

the
   

local NTP servers themselves were virtualised - causing absolute
   

mayhem
 

when
heavy loads hit the other guests on the same physical hosts.

The effect on RHCS (v2.0) is quite dramatic.  A forward jump in time
   

by
 

10
 

seconds always causes a member to prematurely time-out on a network
   

read,
   

causing the member to drop out and trigger a cluster reconfiguration.
Apparently NTP is integrated with RHCS version 3, but I don't know
   

what
 

is
 

meant by that.

I guess this post is not entirely relevent to ZK, but I am just
   

making
 

the
 

point that virtualisation (of NTP servers and or clients) can cause
repeated
premature timeouts.  On Linux, I believe that there is a class of
   

timers
   

provided that is immune to this, but I doubt that there is a platform
independent way of coping with this.

My 2p.

regards,
Martin

On 18 August 2010 16:53, Patrick Huntph...@apache.org  wrote:

   

Do you expect the time to be wrong frequently? If ntp is running
 

it
 

should never get out of sync more than a small amount. As long as
 

this
   

is
 

less than ~your timeout you should be fine.

Patrick


On 08/18/2010 01:04 AM, Qing Yan wrote:

 

Hi,

The testcase is fairly simple. We have a client which connects
   

to
 

ZK,
 

registers an ephemeral node and watches on it. Now change the
   

client
 

machine's time - session killed..

Here is the log:

*2010-08-18 

Re: Session expiration caused by time change

2010-08-19 Thread Benjamin Reed
i'm updating ZOOKEEPER-366 with this discussion and try to get a patch 
out. Qing (or anyone else, can you reproduce it pretty easily?)


thanx
ben

On 08/19/2010 09:29 AM, Ted Dunning wrote:

Nice (modulo inverting the  in your text).

Option 2 seems very simple.  That always attracts me.

On Thu, Aug 19, 2010 at 9:19 AM, Benjamin Reedbr...@yahoo-inc.com  wrote:

   

yes, you are right. we could do this. it turns out that the expiration code
is very simple:

while (running) {
currentTime = System.currentTimeMillis();
if (nextExpirationTime  currentTime) {
this.wait(nextExpirationTime - currentTime);
continue;
}
SessionSet set;
set = sessionSets.remove(nextExpirationTime);
if (set != null) {
for (SessionImpl s : set.sessions) {
sessionsById.remove(s.sessionId); expirer.expire(s);
}
}
nextExpirationTime += expirationInterval;
}

so we can detect a jump very easily: if nextExpirationTime  currentTime,
we have jumped ahead in time.

now the question is, what do we do with this information?

option 1) we could figure out the jump (nextExpirationTime-currentTime is a
good estimate) and move all of the sessions forward by that amount.
option 2) we could converge on the time by having a policy to always wait
at least a half a tick time.

there probably are other options as well. i kind of like option 2. worst
case is it will make the sessions expire in half the time that they should,
but this shouldn't be too much of a problem since clients send a ping if
they are idle for 1/3 of their session timeout.

ben


On 08/19/2010 08:39 AM, Ted Dunning wrote:

 

True.  But it knows that there has been a jump.

Quiet time can be distinguished from clock shift by assuming that members
of
the cluster
don't all jump at the same time.

I would imagine that a recent clock jump estimate could be kept and
buckets that would
otherwise expire due to such a jump could be given a bit of a second lease
on life, delaying
all of their expiration.  Since time-outs are relatively short, the server
would be able to forget
about the bump very shortly.

On Thu, Aug 19, 2010 at 8:22 AM, Benjamin Reedbr...@yahoo-inc.com
  wrote:



   

if we try to use network messages to detect and correct the situation, it
seems like we would recreate the problem we are having with ntp, since
that
is exactly what it does.



 
   
 




Re: Weird ephemeral node issue

2010-08-17 Thread Benjamin Reed

there are two things to keep in mind when thinking about this issue:

1) if a zk client is disconnected from the cluster, the client is 
essentially in limbo. because the client cannot talk to a server it 
cannot know if its session is still alive. it also cannot close its session.


2) the client only finds out about session expiration events when the 
client reconnects to the cluster. if zk tells a client that its session 
is expired, the ephemerals that correspond to that session will already 
be cleaned up.


one of the main design points about zk is that zk only gives correct 
information. if zk cannot give correct information, it basically says i 
don't know. connection loss exceptions and disconnected states are 
basically i don't know.


generally applications we design go into a safe mode, meaning they may 
serve reads but reject changes, when disconnected from zk and only kill 
themselves when they find out their session has expired.


ben

ps - session information is replicated to all zk servers, so if a leader 
dies, all replicas know the sessions that are currently active and their 
timeouts.


On 08/16/2010 09:03 PM, Ted Dunning wrote:

Ben or somebody else will have to repeat some of the detailed logic for
this, but it has
to do with the fact that you can't be sure what has happened during the
network partition.
One possibility is the one you describe, but another is that the partition
happened because
a majority of the ZK cluster lost power and you can't see the remaining
nodes.  Those nodes
will continue to serve any files in a read-only fashion.  If the partition
involves you losing
contact with the entire cluster at the same time a partition of the cluster
into a quorum and
a minority happens, then your ephemeral files could continue to exist at
least until the breach
in the cluster itself is healed.

Suffice it to say that there are only a few strategies that leave you with a
coherent picture
of the universe.  Importantly, you shouldn't assume that the ephemerals will
disappear at
the same time as the session expiration event is delivered.

On Mon, Aug 16, 2010 at 8:31 PM, Qing Yanqing...@gmail.com  wrote:

   

Ouch, is this the current ZK behavior? This is unexpected, if the
client get partitioned from ZK cluster, he should
get notified and take some action(e.g. commit suicide) otherwise how
to tell a ephemeral node is really
up or down? Zombie can create synchronization nightmares..



On Mon, Aug 16, 2010 at 7:22 PM, Dave Wrightwrig...@gmail.com  wrote:
 

Another possible cause for this that I ran into recently with the c
   

client -
 

you don't get the session expired notification until you are reconnected
   

to
 

the quorum and it informs you the session is lost.  If you get
   

disconnected
 

and can't reconnect you won't get the notification.  Personally I think
   

the
 

client api should track the session expiration time locally and
   

information
 

you once it's expired.

On Aug 16, 2010 2:09 AM, Qing Yanqing...@gmail.com  wrote:

Hi Ted,

  Do you mean GC problem can prevent delivery of SESSION EXPIRE event?
Hum...so you have met this problem before?
I didn't see any OOM though, will look into it more.


On Mon, Aug 16, 2010 at 12:46 PM, Ted Dunningted.dunn...@gmail.com
   

wrote:
 

I am assuming that y...
 
   
 




Re: A question about Watcher

2010-08-16 Thread Benjamin Reed
zookeeper takes care of reregistering all watchers on reconnect. you 
don't need to do anything.


ben

On 08/16/2010 09:04 AM, Qian Ye wrote:

Hi all:

Will the watchers of a client be losed when the client disconnects from a
Zookeeper server? It is said at
http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkWatchesthat

*When a client reconnects, any previously registered watches will be
reregistered and triggered if needed. In general this all occurs
transparently.* It means that we need not to do any extra things about
watchers if a client disconnected from Zookeeper server A, and reconnect to
Zookeeper server B, doesn't it? Or I should reregistered all the watchers if
this kind of reconnection happened?

thx~
   




Re: A question about Watcher

2010-08-16 Thread Benjamin Reed

good point ted! i should have waited a bit longer before responding :)

ben

On 08/16/2010 09:20 AM, Ted Dunning wrote:

There are two different concepts.  One is connection loss.  Watchers survive
this and the client automatically connects
to another member of the ZK cluster.

The other is session expiration.  Watchers do not survive this.  This
happens when a client does not provide timely
evidence that it is alive and is marked as having disappeared by the
cluster.

On Mon, Aug 16, 2010 at 9:04 AM, Qian Yeyeqian@gmail.com  wrote:

   

Hi all:

Will the watchers of a client be losed when the client disconnects from a
Zookeeper server? It is said at

http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkWatchesthat

*When a client reconnects, any previously registered watches will be
reregistered and triggered if needed. In general this all occurs
transparently.* It means that we need not to do any extra things about
watchers if a client disconnected from Zookeeper server A, and reconnect to
Zookeeper server B, doesn't it? Or I should reregistered all the watchers
if
this kind of reconnection happened?

thx~
--
With Regards!

Ye, Qian

 




Re: A question about Watcher

2010-08-16 Thread Benjamin Reed
the client does keep track of the watches that it has outstanding. when 
it reconnects to a new server it tells the server what it is watching 
for and the last view of the system that it had.


ben

On 08/16/2010 09:28 AM, Qian Ye wrote:

thx for explaination. Since the watcher can be preserved when the client
switch the zookeeper server it connects to, does that means all the watchers
information will be saved on all the zookeeper servers? I didn't find any
source of the client can hold the watchers information.


On Tue, Aug 17, 2010 at 12:21 AM, Ted Dunningted.dunn...@gmail.com  wrote:

   

I should correct this.  The watchers will deliver a session expiration
event, but since the connection is closed at that point no further
events will be delivered and the cluster will remove them.  This is as good
as the watchers disappearing.

On Mon, Aug 16, 2010 at 9:20 AM, Ted Dunningted.dunn...@gmail.com
wrote:

 

The other is session expiration.  Watchers do not survive this.  This
happens when a client does not provide timely
evidence that it is alive and is marked as having disappeared by the
cluster.

   
 



   




Re: How to handle Node does not exist error?

2010-08-12 Thread Benjamin Reed
i thought there was a jira about supporting embedded zookeeper. (i 
remember rejecting a patch to fix it. one of the problems is that we 
have a couple of places that do System.exit().) i can't seem to find it 
though.


one case that would be great for embedding is writing test cases, so i 
think it would be useful for that.


ben

On 08/12/2010 03:25 PM, Ted Dunning wrote:

I am not saying that the API shouldn't support embedded ZK.

I am just saying that it is almost always a bad idea.  It isn't that I am
asking you to not do it, it is just that I am describing the experience I
have had and that I have seen others have.  In a nutshell, embedding leads
to problems and it isn't hard to see why.

On Thu, Aug 12, 2010 at 3:02 PM, Vishal Kvishalm...@gmail.com  wrote:

   

2. With respect to Ted's point about backward compatibility, I would
suggest
to take an approach of having an API to support embedded ZK instead of
asking users to not embed ZK.

 




Re: ZK recovery questions

2010-07-21 Thread Benjamin Reed
i did a benchmark a while back to see the effect of turning off the 
disk. (it wasn't as big as you would think.) i had to modify the code. 
there is an option to turn off the sync in the config that will get you 
most of the performance you would get by turning off the disk entirely.


ben

On 07/20/2010 11:01 PM, Ashwin Jayaprakash wrote:

I did try a quick test on Windows (yes, some of us use Windows :)

I thought simply changing the dataDir to the /dev/null equivalent on
Windows would do the trick. It didn't work. It looks like a Java issue
because I noticed inconsistencies in the File API regarding this. I wrote
about it here -
http://javaforu.blogspot.com/2010/07/devnull-on-windows.html
devnull-on-windows .

BTW the Windows equivalent is nul.

This is the error I got on Windows (below). The mkdirs() returns false. As
noted on my blog, it returns true for some cases.

2010-07-20 22:25:47,851 - FATAL [main:zookeeperserverm...@62] - Unexpected
exception, exiting abnormally
java.io.IOException: Unable to create data directory nul:\version-2
 at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.init(FileTxnSnapLog.java:79)
 at
org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:102)
 at
org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:85)
 at
org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:51)
 at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:108)
 at
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:76)


Ashwin.
   




Re: Do implementations of Watcher need to be thread-safe?

2010-07-21 Thread Benjamin Reed
as long as a watcher object is only used with a single ZooKeeper object 
it will be called by the same thread.


ben

On 07/21/2010 11:12 AM, Joshua Ball wrote:

Hi,

Do implementations of Watcher need to be thread-safe, or can I assume
that process(...) will always be called by the same thread?

Thanks,
Josh Ua Ball
   




Re: BookKeeper Doubts

2010-07-19 Thread Benjamin Reed

you have concluded correctly.

1) bookkeeper was designed for a process to use as a write-ahead log, so 
as a simplifying assumption we assume a single writer to a log. we 
should be throwing an exception if you try to write to a handle that you 
obtained using openLedger. can you open a jira for that?


2) this is mostly true, there are some exceptions. the creater of a 
ledger can read entries even though the ledger is still being written 
to. we would like to add the ability for a reader to assert the last 
entry in a ledger and read up to that entry, but this is not yet in the 
code.


3) there is one other bug you are seeing, before a ledger can be read, 
it must be closed. as your code shows, a process can open a ledger for 
reading while it is still being written to, which causes an implicit 
close that is not detected by the writer.


this is a nice test case :) thanx
ben

On 07/17/2010 05:02 PM, André Oriani wrote:

Hi,


I was not sure if I had understood the behavior of BookKeeper from
documentation. So I made  a little program, reproduced below, to see what
BookKeeper looks like in action. Assuming my code is correct ( you never
know  when your code has some nasty obvious bugs that only other person than
you can see ) , I could draw the follow conclusions:

1) Only the creator can add entries to a ledger, even though you can  open
the ledger, get a handle and call addEntry on it. No exception is thrown  i.
In other words, you cannot open a ledger for append.

2) Readers are able to see only the entries that were added to a ledger
until someone had opened it for reading. If you want to ensure  readers will
see all the entries, you must add all entries before any reader attempts to
read from the ledger.

Could someone please tell me if those conclusions are correct or if I am
mistaken? In the later case, could that person also tell me what is wrong ?

Thanks a lot for the attention and the patience with this BookKeeper newbie,
André




package br.unicamp.zooexp.booexp;


import java.io.IOException;

import java.util.Enumeration;


import org.apache.bookkeeper.client.BKException;

import org.apache.bookkeeper.client.BookKeeper;

import org.apache.bookkeeper.client.LedgerEntry;

import org.apache.bookkeeper.client.LedgerHandle;

import org.apache.bookkeeper.client.BookKeeper.DigestType;

import org.apache.zookeeper.KeeperException;


public class BookTest {


 public static void main (String ... args) throws IOException,
InterruptedException, KeeperException, BKException{

 BookKeeper bk = new BookKeeper(127.0.0.1);

 LedgerHandle lh = bk.createLedger(DigestType.CRC32, 123
.getBytes());

 long lh_id = lh.getId();

 lh.addEntry(Teste.getBytes());

 lh.addEntry(Test2.getBytes());

 System.out.printf(Got %d entries for lh\n
,lh.getLastAddConfirmed()+1);




 lh.addEntry(Test3.getBytes());

 LedgerHandle lh1 = bk.openLedger(lh_id, DigestType.CRC32, 123
.getBytes());

 System.out.printf(Got %d entries for lh1\n
,lh1.getLastAddConfirmed()+1);

 lh.addEntry(Test4.getBytes());


 lh.addEntry(Test5.getBytes());

 lh.addEntry(Test6.getBytes());

 System.out.printf(Got %d entries for lh\n
,lh.getLastAddConfirmed()+1);

 EnumerationLedgerEntry  seq = lh.readEntries(0,
lh.getLastAddConfirmed());

 while (seq.hasMoreElements()){

 System.out.println(new String(seq.nextElement().getEntry()));

 }

 lh.close();



 lh1.addEntry(Test7.getBytes());

 lh1.addEntry(Test8.getBytes());


 System.out.printf(Got %d entries for lh1\n
,lh1.getLastAddConfirmed()+1);


 seq = lh1.readEntries(0, lh1.getLastAddConfirmed());

 while (seq.hasMoreElements()){

 System.out.println(new String(seq.nextElement().getEntry()));

 }



 lh1.close();


 LedgerHandle lh2 = bk.openLedger(lh_id, DigestType.CRC32, 123
.getBytes());

 lh2.addEntry(Test9.getBytes());


 System.out.printf(Got %d entries for lh2 \n
,lh2.getLastAddConfirmed()+1);


 seq = lh2.readEntries(0, lh2.getLastAddConfirmed());

 while (seq.hasMoreElements()){

 System.out.println(new String(seq.nextElement().getEntry()));

 }


 bk.halt();


 }

}


Output:

Got 2 entries for lh

Got 3 entries for lh1

Got 6 entries for lh

Teste

Test2

Test3

Test4

Test5

Test6

Got 3 entries for lh1

Teste

Test2

Test3

Got 3 entries for lh2

Teste

Test2

Test3
   




RE: cleanup ZK takes 40-60 seconds

2010-07-16 Thread Benjamin Reed
how big is your database? it would be good to know the timing of the two calls. 
shutdown should take very little time.

sent from my droid

-Original Message-
From: Vishal K [vishalm...@gmail.com]
Received: 7/16/10 6:31 PM
To: zookeeper-user@hadoop.apache.org [zookeeper-u...@hadoop.apache.org]
Subject: cleanup ZK takes 40-60 seconds

Hi,

We have embedded ZK server in our application. We start a thread in our
application and call QuorumPeerMain.InitializeArgs().

When cleaning-up ZK we call QuorumPeerMain.shutdown() and wait for the
thread that is calling InitializeArgs() to finish. These two steps are
taking around 60 seconds. I could probably not wait for InitializeArgs() to
finish and that might speed up things.

However, I am not sure why the cleanup should take such a long time. Can
anyone comment on this?

Thanks.
-Vishal


Re: total # of zknodes

2010-07-15 Thread Benjamin Reed

i think there is a wiki page on this, but for the short answer:

the number of znodes impact two things: memory footprint and recovery 
time. there is a base overhead to znodes to store its path, pointers to 
the data, pointers to the acl, etc. i believe that is around 100 bytes. 
you cant just divide your memory by 100+1K (for data) though, because 
the GC needs to be able to run and collect things and maintain a free 
space. if you use 3/4 of your available memory, that would mean with 4G 
you can store about three million znodes. when there is a crash and you 
recover, servers may need to read this data back off the disk or over 
the network. that means it will take about a minute to read 3G from the 
disk and perhaps a bit more to read it over the network, so you will 
need to adjust your initLimit accordingly.


of course this is all back-of-the-envelope. i would suggest doing some 
quick benchmarks to test and make sure your results are in line with 
expectation.


ben

On 07/15/2010 02:56 AM, Maarten Koopmans wrote:

Hi,

I am mapping a filesystem to ZooKeeper, and use it for locking and mapping a 
filesystem namespace to a flat data object space (like S3). So assuming proper 
nesting and small ZooKeeper nodes (  1KB), how many nodes could a cluster with 
a few GBs of memory per instance realistically hold totally?

Thanks, Maarten




Re: Achieving quorum with only half of the nodes

2010-07-14 Thread Benjamin Reed
by custom QuorumVerifier are you referring to 
http://hadoop.apache.org/zookeeper/docs/r3.3.1/zookeeperHierarchicalQuorums.html 
?


ben

On 07/14/2010 12:43 PM, Sergei Babovich wrote:

Hi,
We are currently evaluating use of ZK in our infrastructure. In our
setup we have a set of servers running from two different power feeds.
If one power feed goes away so does half of the servers. This makes
problematic to configure ZK ensemble that would tolerate such outage.
The network partitioning is not an issue in our case. The only solution
I come up with so far is to provide custom QuorumVerifier that will add
a little premium in case if all servers in the quorum set are from the
same group. Basically if we have only half of votes but all of them
belong to the same group then we decide to have a quorum.
Any ideas or better solutions are very appreciated. Sorry if this has
been already discussed/answered.

Regards,
Sergei
This e-mail message and all attachments transmitted with it may contain 
privileged and/or confidential information intended solely for the use of the 
addressee(s). If the reader of this message is not the intended recipient, you 
are hereby notified that any reading, dissemination, distribution, copying, 
forwarding or other use of this message or its attachments is strictly 
prohibited. If you have received this message in error, please notify the 
sender immediately and delete this message, all attachments and all copies and 
backups thereof.

   




Re: Regarding Leader election and the limit on number of clients without performance degradation

2010-07-12 Thread Benjamin Reed
ted is correct, as usual. that warning is really to avoid unnecessary 
load, and 16 clients really don't generate much of a load at all. even 
with thousands of cliets, if they really need the list of children it 
will still be ok. the point of that note was that for leader election 
only one process will emerge, so having a bunch of other processing 
making unneeded requests is wasteful and can be avoided.


ben

On 07/12/2010 01:47 PM, Ted Dunning wrote:

Having 16 clients all wake up and ping ZK is an extremely light load.  The
warning on the recipes page had more to do with the situation where
thousands of nodes wake up at the same time.

On Mon, Jul 12, 2010 at 1:30 PM, Srikanth Bondalapati:
sbondalap...@tagged.com  wrote:

   

Hi,

I am using ZooKeeper service for leader election and group management. I
have read in the site (

http://hadoop.apache.org/zookeeper/docs/r3.2.2/recipes.html#sc_leaderElection
)
under the LeaderElection section that, if all the clients try to access
the getChildren() when trying to become a leader, it causes a bottleneck on
the server. But, I wanted to execute getChildren() method on all the
clients
that have seen a change on the parent's ZNode. So, could you please tell
what could be the maximum number of clients that can be used without any
performance drop on the server, when all the clients try to execute
getChildren() method? Currently, I intend to use 16 clients cluster, and
the
data on each of the ZNodes is very less (say  500 bytes).

Anxiously waiting for your reply,
Thanks  Regards,
Srikanth.

 




Re: running the systest

2010-07-09 Thread Benjamin Reed

can you try the following:

Index: src/contrib/fatjar/build.xml
===
--- src/contrib/fatjar/build.xml(revision 962637)
+++ src/contrib/fatjar/build.xml(working copy)
@@ -46,6 +46,7 @@
fileset dir=${zk.root}/build/classes excludes=**/.generated/
fileset dir=${zk.root}/build/test/classes/
zipgroupfileset dir=${zk.root}/build/lib includes=*.jar /
+ zipgroupfileset dir=${zk.root}/build/test/lib includes=*.jar /
zipgroupfileset dir=${zk.root}/src/java/lib includes=*.jar /
/jar
/target


thanx
ben

On 07/09/2010 09:04 AM, Stuart Halloway wrote:

Happy to do it. Should I change the fatjar build to add junit, or is there 
another way folks prefer to do it?

I am assuming that somebody is running the tests and has a local workaround in 
place. :-)

Stu

   

Hi Stuart,
The instructions are just out of date. If you could open a jira and post a
patch to it that would be great!

We should try getting this in 3.3.2! That would be useful!

Thanks
mahadev


On 7/9/10 6:36 AM, Stuart Hallowaystuart.hallo...@gmail.com  wrote:

 

Hi all,

I am trying to run the systest and have hit a few minor issues:

(1) The readme says src/contrib/jarjar, apparently should be
src/contrib/fatjar

(2) The compiled fatjar seems to be missing junit, so the launch instructions
do not work.

I can fix or workaround these, but I wanted to see if maybe the instructions
are just out of date, and there is an easy (but currently undocumented) way to
launch the tests.

Thanks,
Stu

   
 
   




Re: Suggested way to simulate client session expiration in unit tests?

2010-07-08 Thread Benjamin Reed
the difference between close and disconnect is that close will actually 
try to tell the server to kill the session before disconnecting.


a paranoid lock implementation doesn't need to test it's session. it 
should just monitor watch events to look for disconnect and expired 
events. if a client is in the disconnected state, it cannot reliably 
know if the session is still active, so it should consider the lock in 
limbo until it gets either the reconnect event or the expired event.


ben

On 07/06/2010 05:42 PM, Jeremy Davis wrote:

Thanks!
That seems to work, but it is approximately the same as zooKeeper.close() in
that there is no SessionExpired event that comes up through the default
Watcher.
Maybe I'm assuming more from ZK than I should, but should a paranoid lock
implementation periodically test it's session by reading or writing a value?

Regards,
-JD


On Tue, Jul 6, 2010 at 10:32 AM, Mahadev Konarmaha...@yahoo-inc.comwrote:

   

Hi Jeremy,

  zk.disconnect() is the right way to disconnect from the servers. For
session expiration you just have to make sure that the client stays
disconnected for more than the session expiration interval.

Hope that helps.

Thanks
mahadev


On 7/6/10 9:09 AM, Jeremy Davisjerdavis.cassan...@gmail.com  wrote:

 

Is there a recommended way of simulating a client session expiration in
   

unit
 

tests?
I see a TestableZooKeeper.java, with a pauseCnxn() method that does cause
the connection to timeout/disconnect and reconnect. Is there an easy way
   

to
 

push this all the way through to session expiration?
Thanks,
-JD
   


 




Re: Are Watchers execute sequentially or in parallel ?

2010-06-29 Thread Benjamin Reed
watchers are executed sequentially and in order. there is one dispatch 
thread that invokes the watch callbacks.


ben

ps - in 2) you do not install a watch.

On 06/29/2010 06:13 AM, André Oriani wrote:

Hi,

Are Watchers executed sequentially  or in parallel ? Suppose I want to
monitor the children of a znode for any modification.  I don't want the same
watcher to be re-executed while it is still executing.



1)

public class ChildrenWatcher implements Watcher{

  public void process(WatchedEvent event) {

   //get children and install watcher
   ListString  children = zk.getChildren(path, this);

//process children

 }
}



2)

public class ChildrenWatcher implements Watcher{

  public void process(WatchedEvent event) {

   //get children
   ListString  children = zk.getChildren(path, null);

//process children

  //install watcher
  zk.getChildren(path, null)
 }
}



Does both code achieve the goal or just the code number 2 ?


Tks,
André
   




Re: integration tests

2010-06-23 Thread Benjamin Reed
we do this in our tests for ZooKeeper. bookkeeper uses the testing 
classes as well, unfortunately, we haven't documented the interface.


ben

On 06/22/2010 08:42 PM, Ishaaq Chandy wrote:

Hi all,
First some background:

1. We use maven as our build tool.
2. We use Hudson as our CI server, it is setup to delegate build work
to a cluster of build-slave VMs.
3. We'd like to do very little (preferably none at all) admin work on
each build-slave VM to get it up and running builds. This is so we can
grow the build cluster quickly on demand.

To this end we'd like our tests to be able to run without requiring
external dependencies, i.e. no requirement that there be a running
database or some such. We use Cassandra for data storage and have been
able to quite successfully write a test extension that configures and
starts up an embedded Cassandra instance before running tests that
rely on Cassandra.

Now, I'd like to do the same for ZooKeeper. Has anyone tackled and
solved this problem before?

Thanks,
Ishaaq
   




Re: is ZK client thread safe

2010-06-21 Thread Benjamin Reed

yes. (except for the single threaded C-client library :)

ben

On 06/17/2010 10:16 AM, Jun Rao wrote:

Hi,

Is ZK client thread safe? Is it ok for multiple threads sharing the same ZK
client? Thanks,

Jun
   




Re: Completions in C API

2010-06-03 Thread Benjamin Reed
the call is executed at a later time on a different thread. the zoo_a* 
calls are non-blocking, so (subject to the thread scheduling) usually 
they will return before the request completes.


ben

On 06/03/2010 01:24 PM, Jack Orenstein wrote:

I'm trying to figure out how to use zookeeper's C API. In particular, what can I assume 
about the execution of the completions passed to zoo_aget and zoo_aset? The documentation 
for zoo_aset says the completion is the routine to invoke when the request 
completes. Does this mean that the completion is called some arbitrary time after 
the request completes, on a different thread? Or is it guaranteed to be executed on the 
same thread as the thread initiating the zoo_aset call, and complete before zoo_aset 
returns?  Or something else?

In general, the 3.2.2 docs seem to be pretty thin on the C API. Pointers to 
other relevant material would be appreciated.

Jack
   




Re: zookeeper crash

2010-06-02 Thread Benjamin Reed
charity, do you mind going through your scenario again to give a 
timeline for the failure? i'm a bit confused as to what happened.


ben

On 06/02/2010 01:32 PM, Charity Majors wrote:

Thanks.  That worked for me.  I'm a little confused about why it threw the 
entire cluster into an unusable state, though.

I said before that we restarted all three nodes, but tracing back, we actually 
didn't.  The zookeeper cluster was refusing all connections until we restarted 
node one.  But once node one had been dropped from the cluster, the other two 
nodes formed a quorum and started responding to queries on their own.

Is that expected as well?  I didn't see it in ZOOKEEPER-335, so thought I'd 
mention it.



On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:

   

Hi Charity, unfortunately this is a known issue not specific to 3.3 that
we are working to address. See this thread for some background:

http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html

I've raised the JIRA level to blocker to ensure we address this asap.

As Ted suggested you can remove the datadir -- only on the effected
server -- and then restart it. That should resolve the issue (the server
will d/l a snapshot of the current db from the leader).

Patrick

On 06/02/2010 11:11 AM, Charity Majors wrote:
 

I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an attempt to 
get away from a client bug that was crashing my backend services.

Unfortunately, this morning I had a server crash, and it brought down my entire 
cluster.  I don't have the logs leading up to the crash, because -- 
argghffbuggle -- log4j wasn't set up correctly.  But I restarted all three 
nodes, and odes two and three came back up and formed a quorum.

Node one, meanwhile, does this:

2010-06-02 17:04:56,446 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
2010-06-02 17:04:56,446 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] 
- Reading snapshot 
/services/zookeeper/data/zookeeper/version-2/snapshot.a0045
2010-06-02 17:04:56,476 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id 
=  1, Proposed zxid = 47244640287
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 
47244640287, 4, 1, LOOKING, LOOKING, 1
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 
38654707048, 3, 1, LOOKING, LEADING, 3
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 
38654707048, 3, 1, LOOKING, FOLLOWING, 2
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with 
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir 
/services/zookeeper/data/zookeeper/version-2 snapdir 
/services/zookeeper/data/zookeeper/version-2
2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] 
- Leader epoch a is less than our epoch b
2010-06-02 17:04:56,486 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] 
- Exception when following the leader
java.io.IOException: Error: Epoch of leader is lower
at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] 
- shutdown called
java.lang.Exception: shutdown Follower
at 
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)



All I can find is this, 
http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html, 
which implies that this state should never happen.

Any suggestions?  If it happens again, I'll just have to roll everything back 
to 3.2.1 and live with the client crashes.




   
   




Re: Securing ZooKeeper connections

2010-05-27 Thread Benjamin Reed
actually pat hunt took over that issue: ZOOKEEPER-733. pat has made a 
lot of progress and the patch looks close to being ready.


ben

ps - actually, to be clear the patch adds netty support. the idea is 
that once we have netty in and netty supports SSL quite transparently, 
it should be easy to get SSL in.


On 05/26/2010 04:44 PM, Mahadev Konar wrote:

Hi Vishal,
   Ben (Benjamin Reed) has been working on a netty based client server
protocol in ZooKeeper. I think there is an open jira for it. My network
connection is pretty slow so am finding it hard to search for it.

We have been thinking abt enabling secure connections via this netty based
connections in zookeeper.

Thanks
mahadev


On 5/25/10 12:20 PM, Vishal Kvishalm...@gmail.com  wrote:

   

Hi All,

Since ZooKeeper does not support secure network connections yet, I thought I
would poll and see what people are doing to address this problem. Is anyone
running ZooKeeper over secure channels (client - server and server- server
authentication/encryption)? If yes, can you please elaborate how you do it?

Thanks.

Regards,
-Vishal
 
   




Re: problem connecting to zookeeper server

2010-05-20 Thread Benjamin Reed
good catch lei! if this helps gregory, can you open a jira to throw an 
exception in this situation. we should be throwing an invalid argument 
exception or something in this case.


thanx
ben

On 05/20/2010 09:04 AM, Lei Zhang wrote:

Seems you are passing in wrong arguments:

Should have been:
 public ZooKeeper(String connectString, int sessionTimeout, Watcher
watcher)
 throws IOException

What you have in your client code is:

On Thu, May 20, 2010 at 5:21 AM, Gregory Haskins
gregory.hask...@gmail.comwrote:

   


public App() throws Exception {
zk = new ZooKeeper(192.168.1.124:2181, 0, this);
}


 

Try use a sensible timeout value such as 2. The error you are getting
means the server has timed out the session.

Hope this unstucks you.
   




Re: Xid out of order. Got 8 expected 7

2010-05-12 Thread Benjamin Reed

is this a bug? shouldn't we be returning an error.

ben

On 05/12/2010 11:34 AM, Patrick Hunt wrote:

I think that explains it then - the server is probably dropping the new
(3.3.0) getChildren message (xid 7) as it (3.2.2 server) doesn't know
about that message type. Then the server responds to the client for a
subsequent operation (xid 8), and at that point the client notices that
getChildren (xid 7) got lost.

Patrick

On 05/12/2010 11:30 AM, Jordan Zimmerman wrote:
   

Oh, OK. When I get a moment I'll restart the 3.2.2 and post logs,
etc.

Yes, we're calling getChildren with the callback.

-JZ

On May 12, 2010, at 11:28 AM, Patrick Hunt wrote:

 

I'm still interested though... Are you using the new getChildren
api that was added to the client in 3.3.0? (it provides a Stat
object on return, the old getChildren did not). While we don't
officially support 3.3.0 client with 3.2.2 server (we do support
the other way around), there shouldn't be they type of problem with
this configuration as you describe. I'd still be interested for you
to create that jira.

Regards,

Patrick

On 05/12/2010 11:23 AM, Jordan Zimmerman wrote:
   

Apologies...

I thought I was running 3.3.0 server, but was running 3.2.2
server with 3.3.0 client. I upgraded the server and now all works
again. Sorry to trouble y'all.

-Jordan

On May 12, 2010, at 11:11 AM, Patrick Hunt wrote:

 

Hi Jordan, you've seen this once or frequently? (having the
server + client logs will help alot)

Patrick

On 05/12/2010 11:08 AM, Jordan Zimmerman wrote:
   

Sure - if you think it's a bug.

We were using Zookeeper without issue. I then refactored a
bunch of code and this new behavior started. I'm starting ZK
using zkServer start and haven't made any changes to the
code at all.

I'll get the logs together and post a JIRA.

-JZ

On May 12, 2010, at 10:59 AM, Mahadev Konar wrote:

 

Hi Jordan, Can you create a jira for this? And attach all
the server logs and client logs related to this timeline?
How did you start up the servers? Is there some changes you
might have made accidentatlly to the servers?


Thanks mahadev


On 5/12/10 10:49 AM, Jordan
Zimmermanjzimmer...@proofpoint.com   wrote:

   

We've just started seeing an odd error and are having
trouble determining the cause. Xid out of order. Got 8
expected 7 Any hints on what can cause this? Any ideas
on how to debug?

We're using ZK 3.3.0. The error occurs in
ClientCnxn.java line 781

-Jordan
 
   
 
 
 




Re: How to ensure trasaction create-and-update

2010-03-30 Thread Benjamin Reed
i agree with ted. i think he points out some disadvantages with trying 
do do more. there is a slippery slope with these kinds of things. the 
implementation is complicated enough even with the simple model that we use.


ben

On 03/29/2010 08:34 PM, Ted Dunning wrote:

I perhaps should not have said power, except insofar as ZK's strengths are
in reliability which derives from simplicity.

There are essentially two common ways to implement multi-node update.  The
first is the tradtional db style with begin-transaction paired with either a
commit or a rollback after some number of updates.  This is clearly
unacceptable in the ZK world if the updates are sent to the server because
there can be an indefinite delay between the begin and commit.

A second approach is to buffer all of the updates on the client side and
transmit them in a batch to the server to succeed or fail as a group.  This
allows updates to be arbitrarily complex which begins to eat away at the
no-blocking guarantee a bit.

On Mon, Mar 29, 2010 at 8:08 PM, Henry Robinsonhe...@cloudera.com  wrote:

   

Could you say a bit about how you feel ZK would sacrifice power and
reliability through multi-node updates? My view is that it wouldn't: since
all operations are executed serially, there's no concurrency to be lost by
allowing multi-updates, and there doesn't need to be a 'start / end'
transactional style interface (which I do believe would be very bad).

I could see ZK implement a Sinfonia-style batch operation API which makes
all-or-none updates. The reason I can see that it doesn't already allow
this
is the avowed intent of the original ZK team to keep the API as simple as
it
can reasonably be, and to not introduce complexity without need.

 




Re: Solitication for logging/debugging requirements

2010-03-29 Thread Benjamin Reed
awesome! that would be great ivan. i'm sure pat has some more concrete 
suggestions, but one simple thing to do is to run the unit tests and 
look at the log messages that get output. there are a couple of 
categories of things that need to be fixed (this is in no way exhaustive):


1) messages that have useful information, but only if you look in the 
code to figure out what it means. there are some leader election 
messages that fall into this category. it would be nice to clarify them.
2) there are error messages that really aren't errors. when shutting 
down there are a bunch of errors that are expected, but still logged, 
for example.

3) misclassified error levels

welcome aboard!

ben

On 03/29/2010 10:07 AM, Ivan Kelly wrote:

Hi,

Im going to be using Zookeeper quite extensively for a project in a
few weeks, but development hasn't kicked off yet. This means I have
some time on my hands and I'd like to get familiar with zookeeper
beforehand by perhaps writing some tools to make debugging problems
with it easier so as to save myself some time in the future. Problem
is I haven't had to debug many zookeeper problems yet, so I don't know
where the pain points are.

So, without further ado,
- Are there any places that logging is deficient that sorely needs
improvement?
- Could current logs be improved any amount or presented in a more
readable fashion?
- Would some form of log visualisation be useful (for example in
something approximating a sequence diagram)?

Feel free to suggest anything which the list above doesn't allude to
which you think would be helpful.

Cheers,
Ivan

   




Re: cluster fails to start - broken snapshot?

2010-03-18 Thread Benjamin Reed
we have updated ZOOKEEPER-713 with much more detail, but the bottom line 
is that the Invalid snapshot was caused by an OutOfMemoryError. this 
turns out not be a problem since we recover using an older snapshot. 
there are other things that are happening that are the real causes of 
the problem. see the jira for details.


thanx
ben

On 03/18/2010 09:16 AM, Łukasz Osipiuk wrote:

Hi guys,

Today we experienced another problem with our zookeeper installation.
Due to large attachments I created jira issue for it, even though it
is rather question than bug report.

https://issues.apache.org/jira/browse/ZOOKEEPER-713

Description below:

Today we had major failure in our production environment. Machines in
zookeeper cluster gone wild and all clients got disconnected.
We tried to restart whole zookeeper cluster but cluster got stuck in
leader election phase.

Calling stat command on any machine in the cluster resulted in
'ZooKeeperServer not running' message
In one of logs I noticed 'Invalid snapshot' message which disturbed me a bit.

We did not manage to make cluster work again with data. We deleted all
version-2 directories on all nodes and then cluster started up without
problems.
Is it possible that snapshot/log data got corrupted in a way which
made cluster unable to start?
Fortunately we could rebuild data we store in zookeeper as we use it
only for locks and most of nodes is ephemeral.

I am attaching contents of version-2 directory from all nodes and server logs.
Source problem occurred some time before 15. First cluster restart
happened at 15:03.
At some point later we experimented with deleting version-2 directory
so I would not look at following restart because they can be
misleading due to our actions.

I am also attaching zoo.cfg. Maybe something is wrong at this place.
As I know look into logs i see read timeout during initialization
phase after 20secs (initLimit=10, tickTime=2000).
Maybe all I have to do is increase one or other. which one? Are there
any downsides of increasing tickTime.

Best regards, Łukasz Osipiuk

PS. due to attachment size limit I used split. to untar use
cat nodeX-version-2.tgz-* |tar -xz

   




Re: syncLimit explanation needed?

2010-03-18 Thread Benjamin Reed
yes it means in sync with the leader. syncLimit governs the timeout when 
a follower is actively following a leader. initLimit is the initial 
connection timeout. because there is the potential for more data that 
needs to be transmitted during the initial connection, we want to be 
able to manage the two timeouts differently.


ben

On 03/18/2010 11:48 AM, César Álvarez Núñez wrote:

Hi all,

I'm would like to get a better understanding of syncLimit configuration
property.

Accordingly to Administration Guide: Amount of time, in ticks (see
tickTimefile:/ecija127.ecija.com/Compartido/ThirdPartyLibs/zookeeper-3.2.2/docs/zookeeperAdmin.html#id_tickTime),
to allow followers to sync with ZooKeeper. If followers fall too far behind
a leader, they will be dropped.

...to sync with ZooKeeper means ...to sync with Leader? In this case,
which is the difference with initLimit?

BR,
/César.
   




Re: permanent ZSESSIONMOVED

2010-03-16 Thread Benjamin Reed
do you ever use zookeeper_init() with the clientid field set to 
something other than null?


ben

On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:

Hi everyone!

I am writing to this group because recently we are getting some
strange errors with our production zookeeper setup.

 From time to time we are observing that our client application (C++
based) disconnects from zookeeper (session state is changed to 1) and
reconnects (state changed to 3).
This itself is not a problem - usually application continues to run
without problems after reconnect.
But from time to time after above happens all subsequent operations
start to return ZSESSIONMOVED error. To make it work again we have to
restart application (which creates new zookeeper session).

I noticed that in 3.2.0 introduced a bug
http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using
zookeeper v. 3.2.2.
I just noticed that app at compile time used 3.2.0 library but patches
fixing bug 449 did not touch C client lib so I believe that our
problems are not
related with that.

In zookeeper logs at moment which initiated the problem with client
application I have

node1:
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to
/10.1.112.61:37197 lastZxid 42992576502
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,443 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init
of 0x324dcc1ba580085 valid:true
2010-03-16 14:21:49,443 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:49,444 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@857] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181
remote=/10.1.112.61:37197]

node2:
2010-03-16 14:21:40,580 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:40,581 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181
remote=/10.1.112.61:60693]
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to
/10.1.112.61:48336 lastZxid 42992576502
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,439 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init
of 0x324dcc1ba580085 valid:true

node3:
2010-03-16 02:14:48,961 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 02:14:48,962 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181
remote=/10.1.112.61:57309]

and then lots of entries like this
2010-03-16 02:14:54,696 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49
zxid:0xfffe txntype:unknown
/locks/9871253/lock-8589943989-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
 at 
org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
 at 
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
 at 
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)
2010-03-16 14:22:06,428 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9f6603
zxid:0xfffe txntype:unknown
/locks/1665960/lock-8589961006-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
 at 
org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
 at 
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
 at 
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)


To workaround disconnections I am going to increase session timeout
from 5 to 15 seconds but event if it helps at all it is just a
workaround.

Do you have an idea where is the source of my problem.

Regards, Łukasz Osipiuk



   




Re: permanent ZSESSIONMOVED

2010-03-16 Thread Benjamin Reed
weird, this does sound like a bug. do you have a reliable way of 
reproducing the problem?


thanx
ben

On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:

nope.

I always pass 0 as clientid.

Łukasz

On Tue, Mar 16, 2010 at 16:20, Benjamin Reedbr...@yahoo-inc.com  wrote:
   

do you ever use zookeeper_init() with the clientid field set to something
other than null?

ben

On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:
 

Hi everyone!

I am writing to this group because recently we are getting some
strange errors with our production zookeeper setup.

  From time to time we are observing that our client application (C++
based) disconnects from zookeeper (session state is changed to 1) and
reconnects (state changed to 3).
This itself is not a problem - usually application continues to run
without problems after reconnect.
But from time to time after above happens all subsequent operations
start to return ZSESSIONMOVED error. To make it work again we have to
restart application (which creates new zookeeper session).

I noticed that in 3.2.0 introduced a bug
http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using
zookeeper v. 3.2.2.
I just noticed that app at compile time used 3.2.0 library but patches
fixing bug 449 did not touch C client lib so I believe that our
problems are not
related with that.

In zookeeper logs at moment which initiated the problem with client
application I have

node1:
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to
/10.1.112.61:37197 lastZxid 42992576502
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,443 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init
of 0x324dcc1ba580085 valid:true
2010-03-16 14:21:49,443 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:49,444 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@857] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181
remote=/10.1.112.61:37197]

node2:
2010-03-16 14:21:40,580 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:40,581 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181
remote=/10.1.112.61:60693]
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to
/10.1.112.61:48336 lastZxid 42992576502
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,439 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init
of 0x324dcc1ba580085 valid:true

node3:
2010-03-16 02:14:48,961 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 02:14:48,962 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181
remote=/10.1.112.61:57309]

and then lots of entries like this
2010-03-16 02:14:54,696 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49
zxid:0xfffe txntype:unknown
/locks/9871253/lock-8589943989-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
 at
org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
 at
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
 at
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)
2010-03-16 14:22:06,428 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9f6603
zxid:0xfffe txntype:unknown
/locks/1665960/lock-8589961006-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
 at
org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
 at
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
 at
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)


To workaround disconnections I am going to increase session timeout
from 5 to 15 seconds but event if it helps at all it is just a
workaround.

Do you have an idea where is the source of my problem.

Regards, Łukasz Osipiuk




   


 



   




Re: Managing multi-site clusters with Zookeeper

2010-03-15 Thread Benjamin Reed
it is a bit confusing but initLimit is the timer that is used when a 
follower connects to a leader. there may be some state transfers 
involved to bring the follower up to speed so we need to be able to 
allow a little extra time for the initial connection.


after that we use syncLimit to figure out if a leader or follower is 
dead. a peer (leader or follower) is considered dead if syncLimit ticks 
goes by without hearing from the other machine. (this is after the 
initial connection has been made.)


please open a jira to made the text a bit more explicit. feel free to 
add suggestions :)


thanx
ben

On 03/15/2010 04:17 AM, Michael Bauland wrote:

Hi Patrick,

I'm also setting up a Zookeeper ensemble across three different
locations and I've got some questions regarding the parameters as
specified on the page you mentioned:

   

That's controlled by the tickTime/synclimit/initlimit/etc.. see more
about this in the admin guide: http://bit.ly/c726DC
 

- What's the difference between initLimit and syncLimit? For initLimit
it says this is the time to allow followers to connect and sync to a
leader, and syncLimit is the time to allow followers to sync with
ZooKeeper. To me this sounds very similar, since Zookeeper in the
second definition probably means the Zookeeper leader, doesn't it?

- When I connect with a client to the Zookeeper ensemble I provide the
three IP addresses of my three Zookeeper servers. Does the client then
choose one of them arbitrarily or will it always try to connect to the
first one first? I'm asking since I would like to have my clients first
try to connect to the local Zookeeper server and only if that fails (for
whatever reason, maybe it's down) it should try to connect to one of the
servers on a different physical location.


   

You'll want to increase from the defaults since those are typically for
high performance interconnect (ie within colo). You are correct though,
much will depend on your env. and some tuning will be involved.
 

Do you have any suggestions for the parameters? So far I left tickTime
at 2 sec and increased initLimit and syncLimit to 30 (i.e., one minute).

Our sites are connected with 1Gbit to the Internet, but of course we
have no influence on what's in between. The data managed by zookeeper is
quite large (snapshots are 700 MByte, but they may increase in the future).

Thanks for your help,

Michael


   




Re: Znode ACL watcher?

2010-02-22 Thread Benjamin Reed
no, you cannot watch for ACL changes. it is one of the 
API/implementation simplifications we did since we didn't have a good 
use case for it.


it does seem a little bit weird. we are following file system semantics 
here. i guess for ultimate security only clients with admin permission 
would be able to see an ACL.


ben

On 02/22/2010 08:00 AM, Mark Masse wrote:

Hi,

Does anyone know if there's a way to get a Watcher notification when a
znode's ACL changes?

I also wanted to ask if it seems weird that you can read a znode's ACL
even if you don't have permissions to read the data.

Thanks,

  --
Mark Masse
http://www.massedynamic.org
   




Re: Ordering guarantees for async callbacks vs watchers

2010-02-11 Thread Benjamin Reed
just to expand on mahadev's answer a little bit: the basic guarantee is 
that you will see the watch event before you see the change. so let's 
say you call getChildren( /foo, w, acb, ctx) twice and while you do 
that another client creates a child of /foo. there are three scenarios:


1) the create happens before the first call to getChildren: in this case 
there is no watch event because the first call to getChildren will list 
the new child.
2) the create happens after the first call to getChildren and before the 
second call: in this case the watch event callback will happen at the 
client before acb is invoked for the result of the second getChildren 
call. in other words, the first callback to acb for the result of the 
first getChildren call will not list the newly created child, then you 
will get a callback on w to say that the list of children of /foo has 
changed, then you will get second callback on acb for the result of the 
second call that will list the newly created child.
3) the create happens after the second call to getChildren: in this case 
acb will be invoked once for each invocation of getChildren and both 
times acb will have the same list of children, then you will get a 
callback on w to say that the list of children of /foo has changed.


ben

Mahadev Konar wrote:

Hi martin,
 a call like getchildren(final String path, Watcher watcher,
ChildrenCallback cb, Object ctx)

Means that set a watch on this node for any further changes on the server. A
client will see the response to getchildren data before the above watch is
fired. 


Hope that helps.

Thanks
mahadev


On 2/10/10 6:59 PM, Martin Traverso mtrave...@gmail.com wrote:

  

What are the ordering guarantees for asynchronous callbacks vs watcher
notifications (Java API) when both are used in the same call? E.g.,
for getChildren(final String path, Watcher watcher, ChildrenCallback cb,
Object ctx)

Will the callback always be invoked before the watcher if there is a state
change on the server at about the same time the call is made?

I *think* that's what's implied by the documentation, but I'm not sure I'm
reading it right:

All completions for asynchronous calls and watcher callbacks will be made
in order, one at a time. The caller can do any processing they wish, but no
other callbacks will be processed during that time. (
http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperProgrammers.html#Java+
Binding
)

Thanks!

Martin



  




RE: When session expired event fired?

2010-02-08 Thread Benjamin Reed
i was looking through the docs to see if we talk about handling session 
expired, but i couldn't find anything. we should probably open a jira to add to 
the docs, unless i missed something. did i?

ben

-Original Message-
From: Mahadev Konar [mailto:maha...@yahoo-inc.com] 
Sent: Monday, February 08, 2010 2:43 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: When session expired event fired?

Hi,
 a zookeeper client does not expire a session until and unless it is able to
connect to one of the servers. In your case if you kill all the servers, the
client is not able to connect to any of the servers and will keep trying to
connect to the three servers. It cannot expire a session on its own and
needs to hear from the server to know if the session is expired or not.

Does that help? 

Thanks
mahadev


On 2/8/10 2:37 PM, neptune opennept...@gmail.com wrote:

 Hi all.
 I have a question. I started zookeeper(3.2.2) on three servers.
 When session expired event fired in following code?
 I expected that if client can't connect to server(disconnected) for session
 timeout, zookeeper fires session expired event.
 I killed three zookeeper server sequentially. Client retry to connect
 zookeeper server. Never occured Expired event.
 
 *class WatcherTest {
   public static void main(String[] args) {
 (new **WatcherTest*()).exec();
 *  }
 
   private WatcherTest() throws Exception {
 zk = new ZooKeeper(server1:2181,server2:2181:server3:2181, 10 * 1000,
 this);
   }
   private void exec() {
 while(ture) {
   //do something
 }
   }
   public void process(WatchedEvent event) {
 if (event.getType() == Event.EventType.None) {
   switch (event.getState()) {
   case SyncConnected:
 System.out.println(ZK SyncConnected);
 break;
   case Disconnected:
 System.out.println(ZK Disconnected);
 break;
   case Expired:
 System.out.println(ZK Session Expired);
 System.exit(0);
 break;
   }
 }
 }
 *



Re: how to handle re-add watch fails

2010-02-01 Thread Benjamin Reed
sadly connectionloss is the really ugly part of zookeeper! it is a pain 
to deal with. i'm not sure we have best practice, but i can tell you 
what i do :) ZOOKEEPER-22 is meant to alleviate this problem.


i usually use the asynch API when handling the watch callback. in the 
completion function if there is a connection loss, i issue another async 
getChildren to retry. this avoids the blocking caller by doing a 
synchronous retry that eric alluded to, but the behavior is effectively 
the same: you retry the request.


you don't need to worry about multiple watches being added colin. 
zookeeper keeps track of which watchers have registered which watches 
and will not register deplicate watches for the same watcher. (hopefully 
you can parse that :)


ben

Colin Goodheart-Smithe wrote:

We are having similar problems to this.  At the moment we wrap ZooKeeper
in a class which retries requests on KeeperException.ConnectionLoss to
avoid no watcher being added, but we are worried that this may result in
multiple watchers being added if the watcher is successfully added but
the server returns a Connection Loss

Colin


-Original Message-
From: Eric Bowman [mailto:ebow...@boboco.ie] 
Sent: 01 February 2010 10:22

To: zookeeper-user@hadoop.apache.org
Subject: Re: how to handle re-add watch fails

I was surprised to not get a response to this ... is this a no-brainer? 
Too hard to solve?  Did I not express it clearly?  Am I doing something

dumb? :)

Thanks,
Eric

On 01/25/2010 01:05 PM, Eric Bowman wrote:
  

I'm curious, what is the best practice for how to handle the case
where re-adding a watch inside a Watcher.process callback fails?

I keep stumbling upon the same kind of thing, and the possibility of
race conditions or undefined behavior keep troubling me.  Maybe I'm
missing something.

Suppose I have a callback like:

public void process( WatchedEvent watchedEvent )
{
if ( watchedEvent.getType() ==
Event.EventType.NodeChildrenChanged ) {
try {
... do stuff ...
}
catch ( Throwable e ) {
log.error( Could not do stuff!, e );
}
try {
zooKeeperManager.watchChildren( zPath, this );
}
catch ( InterruptedException e ) {
log.info( Interrupted adding watch -- shutting down?


);
  

return;
}
catch ( KeeperException e ) {
// oh crap, now what?
}
}
}

(In this cases, watchChildren is just calling getChildren and passing
the watcher in.)

It occurs to me I could get more and more complicated here:  I could
wrap watchChildren in a while loop until it succeeds, but that seems
kind of rude to the caller.  Plus what if I get a
KeeperException.SessionExpiredException or a
KeeperException.ConnectionLossException?  How to handle that in this
loop?  Or I could send some other thread a message that it needs to


keep
  

trying until the watch has been re-added ... but ... yuck.

I would very much like to just setup this watch once, and have


ZooKeeper
  

make sure it keeps firing until I tear down ZooKeeper -- this logic
seems tricky for clients, and quite error prone and full of race


conditions.
  

Any thoughts?

Thanks,
Eric

  




  




Re: Q about ZK internal: how commit is being remembered

2010-01-28 Thread Benjamin Reed
henry is correct. just to state another way, Zab guarantees that if a 
quorum of servers have accepted a transaction, the transaction will 
commit. this means that if less than a quorum of servers have accepted a 
transaction, we can commit or discard. the only constraint we have in 
choosing is ordering. we have to decide which partially accepted 
transactions are going to be committed and which discarded before we 
propose any new messages so that ordering is preserved.


ben

Henry Robinson wrote:

Hi -

Note that a machine that has the highest received zxid will necessarily have
seen the most recent transaction that was logged by a quorum of followers
(the FIFO property of TCP again ensures that all previous messages will have
been seen). This is the property that ZAB needs to preserve. The idea is to
avoid missing a commit that went to a node that has since failed.

I was therefore slightly imprecise in my previous mail - it's possible for
only partially-proposed proposals to be committed if the leader that is
elected next has seen them. Only when another proposal is committed instead
must the original proposal be discarded.

I highly recommend Ben Reed's and Flavio Junqueira's LADIS paper on the
subject, for those with portal.acm.org access:
http://portal.acm.org/citation.cfm?id=1529978

Henry

On 27 January 2010 21:52, Qian Ye yeqian@gmail.com wrote:

  

Hi Henry:

According to your explanation, *ZAB makes the guarantee that a proposal
which has been logged by
a quorum of followers will eventually be committed* , however, the source
code of Zookeeper, the FastLeaderElection.java file, shows that, in the
election, the candidates only provide their zxid in the votes, the one with
the max zxid would win the election. I mean, it seems that no check has
been
made to make sure whether the latest proposal has been logged by a quorum
of
servers.

In this situation, the zookeeper would deliver a proposal, which is known
as
a failed one by the client. Imagine this scenario, a zookeeper cluster with
5 servers, Leader only receives 1 ack for proposal A, after a timeout, the
client is told that the proposal failed. At this time, all servers restart
due to a power failure. The server have the log of proposal A would be the
leader, however, the client is told the proposal A failed.

Do I misunderstand this?


On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson he...@cloudera.com
wrote:



Qing -

That part of the documentation is slightly confusing. The elected leader
must have the highest zxid that has been written to disk by a quorum of
followers. ZAB makes the guarantee that a proposal which has been logged
  

by


a quorum of followers will eventually be committed. Conversely, any
proposals that *don't* get logged by a quorum before the leader sending
them
dies will not be committed. One of the ZAB papers covers both these
situations - making sure proposals are committed or skipped at the right
moments.

So you get the neat property that leader election can be live in exactly
the
case where the ZK cluster is live. If a quorum of peers aren't available
  

to


elect the leader, the resulting cluster won't be live anyhow, so it's ok
for
leader election to fail.

FLP impossibility isn't actually strictly relevant for ZAB, because FLP
requires that message reordering is possible (see all the stuff in that
paper about non-deterministically drawing messages from a potentially
deliverable set). TCP FIFO channels don't reorder, so provide the extra
signalling that ZAB requires.

cheers,
Henry

2010/1/26 Qing Yan qing...@gmail.com

  

Hi,

I have question about how zookeeper *remembers* a commit operation.

According to




http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary


quote


The leader will issue a COMMIT to all followers as soon as a quorum of
followers have ACKed a message. Since messages are ACKed in order,


COMMITs
  

will be sent by the leader as received by the followers in order.

COMMITs are processed in order. Followers deliver a proposals message


when
  

that proposal is committed.
/quote

My question is will leader wait for COMMIT to be processed by quorum
of followers before consider
COMMIT to be success? From the documentation it seems that leader


handles


COMMIT asynchronously and
don't expect confirmation from followers. In the extreme case, what


happens
  

if leader issue a COMMIT
to all followers and crash immediately before the COMMIT message can go


out
  

of the network. How the system
remembers the COMMIT ever happens?

Actually this is related to the leader election process:

quote
ZooKeeper messaging doesn't care about the exact method of electing a
leader
has long as the following holds:

  -

  The leader has seen the highest zxid of all the followers.
  -

  A quorum of servers have committed to following the leader.

 Of these two 

Re: Dependency on JBoss JMX

2010-01-28 Thread Benjamin Reed
there aren't any dependencies on jboss. can you clarify the dependency 
that you are seeing?


thanx
ben

Gustavo Niemeyer wrote:

Hello there,

Is the dependency on JBoss a hard one, or is there a way to not use
it?  Perhaps an alternative package providing the same interface?

I'm trying to get it included in Ubuntu and being asked about this.

Thanks in advance,

  




Re: ZAB kick Paxos butt?

2010-01-20 Thread Benjamin Reed

hi Qing,

i'm glad you like the page and Zab.

yes, we are very familiar with Paxos. that page is meant to show a 
weakness of Paxos and a design point for Zab. it is not to say Paxos is 
not useful. Paxos is used in the real world in production systems. 
sometimes there are not order dependencies between messages, so Paxos is 
fine.


in cases where order is important, multiple messages are batched into a 
single operation and only one operation is outstanding at a time. (i 
believe that this is what Chubby does, for example.) this is the 
solution you allude to: wait for 27 to commit before 28 is issued.


for ZooKeeper we do have order dependencies and we wanted to have 
multiple operations in progress at various stages of the pipeline to 
allow us to lower latencies as well as increase our bandwidth 
utilization, which led us to Zab.


ben

Qing Yan wrote:

Hello,
Anyone familer with Paxos protocol here?
I was doing some comparision of ZAB vs Paxos... first of all, ZAB's FIFO
based protocol is really cool!

 http://wiki.apache.org/hadoop/ZooKeeper/PaxosRun mentioned the
inconsistency case for Paxos(the state change B depends upon A, but A was
not committed).
 In the Paxos made simple paper, author suggests fill the GAP (lost state
machine changes) with NO OP opeartion.

  Now I have some serious doubts how could Paxos be any useful in the real
world. yeah you do reach the consesus - albeit the content
is inconsistent/corrupted !?

  E.g. on the wiki page, why the Paxos state machine allow fire off 27,28
concurrently where there is actually depedency? Shouldn't you wait instance
27 to be committed before start 28?
  Did I miss something?

  Thanks for the enlight!

   Cheers

Qing
  




RE: Does zookeeper support listening on a specified address?

2009-12-21 Thread Benjamin Reed
no please open a jira as a new feature request.

sent from my droid

-Original Message-
From: Steve Chu [stv...@gmail.com]
Received: 12/21/09 3:44 AM
To: zookeeper-user@hadoop.apache.org [zookeeper-u...@hadoop.apache.org]
Subject: Does zookeeper support listening on a specified address?


Hi, all,

I only see  clientPort option in configuration. Does zookeeper support
bind to a specified network address, because in my box, multiple
network interface presents and I want bind to specified one.

I checked src/java/main/org/apache/zookeeper/server/ServerConfig.java,
seems no server address option.

Best Regards,

Steve


Re: Share Zookeeper instance and Connection Limits

2009-12-16 Thread Benjamin Reed
I agree with Ted, it doesn't seem like a good idea to do in practice. 
however, you do have a couple of options if you are just testing things:


1) use tmpfs
2) you can set forceSync to no in the configuration file to disable 
syncing to disk before acknowledging responses
3) if you really want to make the disk write go away, you can modify the 
SyncRequestProcessor in the code


ben

Ted Dunning wrote:

I think that htis would be a very bad idea because of restart issues.  As it
stands, ZK reads from disk snapshots on startup to avoid moving as much data
from other members of the cluster.

You might consider putting the snapshots and log on a tmpfs file system if
you really, really want this.

On Wed, Dec 16, 2009 at 1:08 PM, Thiago Borges thbor...@gmail.com wrote:

  

Can Zookeeper ensemble runs only in memory rather than write in both memory
and disk? This makes senses since I have a high reliable system? (Of course
at some time we need a dump to shutdown and restart the entire system).

Well, the disk IO or network first limits the throughput?

Thanks for you quick response. I'm studding Zookeeper in my master thesis,
for coordinate distributed index structures.






  




Re: size of data / number of znodes

2009-12-15 Thread Benjamin Reed
there aren't any limits on the number of znodes, it's just limited by 
your memory. there are two things (probably more :) to keep in mind:


1) the 1M limit also applies to the children list. you can't grow the 
list of children to more than 1M (the sum of the names of all of the 
children) otherwise you cannot to a getChildren(). so, yes, you need to 
do some bucketing to keep the number of children to something 
reasonable. assuming your names will be less than 100 bytes, you 
probably want to limit the number of children to 10,000.


2) since there are times that you need to do a state transfer between 
servers (dump all the state from one to the other to bring it online) it 
may take a while depending on your network speed. you may need to bump 
up the default initLimit, so make sure you do some benchmarking on your 
platform to test your configuration parameters.


ben

Michael Bauland wrote:

Hello,

I'm new to the Zookeeper project and wondering whether our use case is a
good one for Zookeeper. I read the documentation, but couldn't find an
answer. At some point it says that

  

A common property of the various forms of coordination data is that they are 
relatively small: measured in kilobytes. The ZooKeeper client and the server 
implementations have sanity checks to ensure that znodes have less than 1M of 
data



I couldn't find any limits on the number of znodes used, only that each
znode should only contain little data. We were planning to use a million
znodes (each containing a few hundred bytes of data). Would this use
case be acceptable for Zookeeper? And if so, does it matter if we have a
flat hierarchy (i.e, all nodes have the root node as their direct
ancestor) or should we introduce some (artificial) hierarchy levels to
have a more tree-like structure?

Thanks in advance for your answer.
Cheers,

Michael

  




Re: Zookeeper Presentation

2009-11-13 Thread Benjamin Reed
there are a bunch of presentations you can grab at 
http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations


ben

Mark Vigeant wrote:

Hey Everyone,

I'm supposed to give a presentation next week about the basic functionality and 
uses of zookeeper. I was wondering if anybody out there had:


1)  A similar presentation that I could use at a starting point for 
inspiration

2)  A cool project they worked on in zookeeper that I can cite as an 
example of the strength and usefulness of zookeeper.

I'm going to also show them the example code and run a few things through the 
terminal. I am also an HBase user so that is also something I can use to talk 
about.

Thanks a lot for your time and help!

Mark Vigeant
RiskMetrics Group, Inc.

This email message and any attachments are for the sole use of the intended 
recipients and may contain proprietary and/or confidential information which 
may be privileged or otherwise protected from disclosure. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not an 
intended recipient, please contact the sender by reply email and destroy the 
original message and any copies of the message as well as any attachments to 
the original message.
  




Re: Some thoughts on Zookeeper after using it for a while in the CXF/DOSGi subproject

2009-11-11 Thread Benjamin Reed
david, it should be pretty easy to do since we do it in our test cases. 
(start and stop servers.) the problem is that we haven't really exposed 
the interfaces. (but we have wanted to.) and we don't have tests for 
those non-existent exposed interfaces :) with a clean interface it 
should be pretty easy to get rid of the System.exits.


ben

dav...@apache.org wrote:

Ok - I get the message :)
Let me see if I can do some experimenting around running the zookeeper
server in OSGi and I'll report back...

David

2009/11/10 Patrick Hunt ph...@apache.org
  

I couldn't find a JIRA for removing the sys exits so I created one:
https://issues.apache.org/jira/browse/ZOOKEEPER-575

there's also this which seems like it should be easy for someone
who knows osgi container jar format requirements:
https://issues.apache.org/jira/browse/ZOOKEEPER-425

Now we just need to find someone who's interested and who would really like to 
run the server in his osgi container to work on these... (hint hint) ;-)

Patrick

Ted Dunning wrote:


Running ZK in an OSGi container is a great idea and should be really easy to
make happen.

This should be a new JIRA, in my opinion.

On Mon, Nov 9, 2009 at 12:10 PM, dav...@apache.org wrote:

  

Just wondering has any progress been made on this since? I would really
like
to run the ZooKeeper server as a bundle in my OSGi container. Is there a
JIRA to track this? If not I can create one :)




  




Re: Struggling with a simple configuration file.

2009-10-09 Thread Benjamin Reed
right at the beginning of 
http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperStarted.html it 
shows you the minimum standalone configuration.


that doesn't explain the 0 id. i'd like to try an reproduce it. do you 
have an empty data directory with a single file, myid, set to 1?


ben

Leonard Cuff wrote:

I¹ve been developing for ZooKeeper for a couple months now, recently running
in a test configuration with 3 ZooKeeper servers. I¹m running 3.2.1 with no
problems. Recently I tried to move to a single server configuration for the
development team environment, but couldn¹t get the configuration to work. I
get the error java.lang.RuntimeException: My id 0 not in the peer list

This would seem to imply that the myid file is set to zero. But ...it¹s set
to 1. 


What¹s puzzling to me is my original configuration of servers was this:

server.1=ind104.an.dev.fastclick.net:2182:2183   --- The machine I¹m trying
to run standalone on.
server.2=build101.an.dev.fastclick.net:2182:2183
server.3=cmedia101.an.dev.fastclick.net:2182:2183

I just removed the last two lines, and ran zkServer.sh start.  It fails with
the described log message. (Full log given below).
When I put the server.2 and server.3 lines back in, it works fine, and is
following the build101 machine.

I decided to try changing the server.1 to server.0, also changed the myid
file contents from 1 to zero.  I get a very different error scenario: A
continuously-occurring Null Pointer exception:

2009-10-09 04:22:36,284 - WARN  [QuorumPeer:/0.0.0.0:2181:quorump...@490] -
Unexpected exception
java.lang.NullPointerException
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.totalOrderPredicate(Fa

stLeaderElection.java:466)
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLead

erElection.java:635)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)


I¹m at a loss to know where I¹ve gone astray.

Thanks in advance for any and all help.

Leonard

--- the first log

2009-10-09 04:08:58,769 - INFO  [main:quorumpeercon...@80] - Reading
configuration from:
/vcm/home/sandbox/ticket_161758-1/vcm/component/zookeeper/conf/zoo.cfg.dev
2009-10-09 04:08:58,795 - INFO  [main:quorumpeerm...@118] - Starting quorum
peer
2009-10-09 04:08:58,845 - FATAL [main:quorumpeerm...@86] - Unexpected
exception, exiting abnormally
java.lang.RuntimeException: My id 0 not in the peer list
at 
org.apache.zookeeper.server.quorum.QuorumPeer.startLeaderElection(QuorumPeer

.java:333)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:314)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMa

in.java:137)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPee

rMain.java:102)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:7

5)

-- the second log

2009-10-09 04:22:36,284 - WARN  [QuorumPeer:/0.0.0.0:2181:quorump...@490] -
Unexpected exception
java.lang.NullPointerException
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.totalOrderPredicate(Fa

stLeaderElection.java:466)
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLead

erElection.java:635)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)

2009-10-09 04:22:36,285 - INFO  [QuorumPeer:/0.0.0.0:2181:quorump...@487] -
LOOKING
2009-10-09 04:22:36,285 - INFO
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@579] - New election: 12
2009-10-09 04:22:36,285 - INFO
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@618] - Notification: 0, 12,
43050, 0, LOOKING, LOOKING, 0
2009-10-09 04:22:36,285 - WARN  [QuorumPeer:/0.0.0.0:2181:quorump...@490] -
Unexpected exception
java.lang.NullPointerException
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.totalOrderPredicate(Fa

stLeaderElection.java:466)
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLead

erElection.java:635)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)

2009-10-09 04:22:36,286 - INFO  [QuorumPeer:/0.0.0.0:2181:quorump...@487] -
LOOKING
2009-10-09 04:22:36,286 - INFO
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@579] - New election: 12
2009-10-09 04:22:36,286 - INFO
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@618] - Notification: 0, 12,
43051, 0, LOOKING, LOOKING, 0
2009-10-09 04:22:36,286 - WARN  [QuorumPeer:/0.0.0.0:2181:quorump...@490] -
Unexpected exception
java.lang.NullPointerException
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.totalOrderPredicate(Fa

stLeaderElection.java:466)
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLead

erElection.java:635)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)

2009-10-09 04:22:36,286 - INFO  [QuorumPeer:/0.0.0.0:2181:quorump...@487] -
LOOKING
2009-10-09 04:22:36,287 - INFO

Re: The idea behind 'myid'

2009-09-25 Thread Benjamin Reed
can you clarify what you are asking for? are you just looking for 
motivation? or are you trying to find out how to use it?


the myid file just has the unique identifier (number) of the server in 
the cluster. that number is matched against the id in the configuration 
file. there isn't much to say about it: 
http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperStarted.html


ben

Ørjan Horpestad wrote:

Hi!
Can someone pin-point me to a site (or please explain ) where I can
read about the use of the myid-file for configuring the id of the
ZooKeeper servers?
I'm sure there is a good reason for using this approach, but it is the
first time I have come over this type of non-automatic way for
administrating replicas.

Regards, Orjan
  




Re: The idea behind 'myid'

2009-09-25 Thread Benjamin Reed
you and ted are correct. the id gives zookeeper a stable identifier to 
use even if the ip address changes. if the ip address doesn't change, we 
could use that, but we didn't want to make that a built in assumption. 
if you really do have a rock solid ip address, you could make a wrapper 
startup script that starts up and creates the myid file based on the ip 
address. i gotta say though, i've found that such assumptions are often 
found to be invalid.


ben

Eric Bowman wrote:

Another way of doing it, though, would be to tell each instance which IP
to use at startup.

That way the config can be identical for all users, and there can be
whatever logic is required to figure out the right IP address, in the
place where logic executing anyhow.

I do agree that maintaining the myid file is ackward compared to other
approaches that are working elsewhere.  It's not really clear what
purpose the my id serves except to bind an ip address to a running instance.

cheers,
Eric

Ted Dunning wrote:
  

A server doesn't have a unique IP address.

Each interface can have 1 or more IP addresses and there can be many
interfaces.  Furthermore, an IP address can move from one machine to
another.

2009/9/25 Ørjan Horpestad orj...@gmail.com

  


Hi Ben

Well, im just wondering why the server's own unique IP-address isn't
good enough as a valid identifyer; it strikes me to be a bit
exhausting to manually set the id for each server in the cluster. Or
maybe there is some details im not getting here  :-)

Regards, Orjan

On Fri, Sep 25, 2009 at 3:56 PM, Benjamin Reed br...@yahoo-inc.com
wrote:

  

can you clarify what you are asking for? are you just looking for
motivation? or are you trying to find out how to use it?

the myid file just has the unique identifier (number) of the server in
  


the

  

cluster. that number is matched against the id in the configuration file.
there isn't much to say about it:
http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperStarted.html

ben

Ørjan Horpestad wrote:
  


Hi!
Can someone pin-point me to a site (or please explain ) where I can
read about the use of the myid-file for configuring the id of the
ZooKeeper servers?
I'm sure there is a good reason for using this approach, but it is the
first time I have come over this type of non-automatic way for
administrating replicas.

Regards, Orjan


  
  



  




  




Re: How to expire a session

2009-09-25 Thread Benjamin Reed
so you have two problems going on. both have the same root: 
zookeeper_init returns before a connection and session is established 
with zookeeper, so you will not be able to fill in myid until a 
connection is made. you can do something with a mutex in the watcher to 
wait for a connection, or you could do something simple like:


while(zoo_state(zh_1) != ZOO_CONNECTED_STATE) {
   sleep(1);
 }
 myid = *zoo_client_id(zh_1);

the second part of the problem is related. you need to make sure you are 
connected before you do the close.


ben

Leonard Cuff wrote:

In the FAQ, there is a question
4. Is there an easy way to expire a session for testing?

And the last part of the answer reads:
   In the case of testing we want to cause a problem, so to explicitly
expire a session an application connects to ZooKeeper, saves the
session id and password, creates another ZooKeeper handle with that
id and password, and then closes the new handle. Since both handles
reference the same session, the close on second handle will
invalidate the session causing a SESSION_EXPIRED on the first handle.


(I assume when it says ³creates another ZooKeeper handle² I¹m assuming it
means do that by calling init_zookeeper. Is that correct?


Here¹s my skeleton code, which doesn¹t work. ...


clientid_t   myid;
clientid_t   another_id;
zhandle_tzh_1;
zhandle_tzh_2;
zoo_deterministic_conn_order(1);
zh_1 = zookeeper_init ( servers, watcher, 1, myid, 0, 0);
if ( !zh_1 ) { 
...error...

}
// Catch sigusr1 and set the havoc flag
if ( cry_havoc_and_let_loose_the_dogs_of_war ) {
memcpy ( another_id, myid, sizeof (clientid_t));
zh_2 = zookeeper_init ( servers, destroy_watcher, 1,
another_id, 0, 0);
if ( ! zh_2 ) {
 errror ...
}   
if ( !nzh ) {

... error ...
}   
zookeeper_close ( zh_2);// Shouldn't I get a session expire

shortly after this?
}

But I don¹t get a session expire.  Can someone tell me what I¹m doing wrong?

TIA,

Leonard

Leonard Cuff
lc...@valueclick.com

³This email and any files included with it may contain privileged,
proprietary and/or confidential information that is for the sole use of the
intended recipient(s).  Any disclosure, copying, distribution, posting, or
use of the information contained in or attached to this email is prohibited
unless permitted by the sender.  If you have received this email in error,
please immediately notify the sender via return e-mail, telephone, or fax
and destroy this original transmission and its included files without
reading or saving it in any manner. Thank you.²






This email and any files included with it may contain privileged,
proprietary and/or confidential information that is for the sole use
of the intended recipient(s).  Any disclosure, copying, distribution,
posting, or use of the information contained in or attached to this
email is prohibited unless permitted by the sender.  If you have
received this email in error, please immediately notify the sender
via return email, telephone, or fax and destroy this original transmission
and its included files without reading or saving it in any manner.
Thank you.
  




Re: Start problem of Running Replicated ZooKeeper

2009-09-23 Thread Benjamin Reed
The connection refused message as opposed to no route to host, or 
unknown host, indicate that zookeeper has not been started on the other 
machines. are the other machines giving similar errors?


ben

Le Zhou wrote:

Hi,
I'm trying to install HBase 0.20.0 in fully distributed mode on my cluster.
As HBase depends on Zookeeper, I have to know first how to make Zookeeper
work.
I download the release 3.2.1 and install it on each machine in my cluster.

Zookeeper in standalone mode works well on each machine in my cluster. I
follow the Zookeeper Getting Started Guide and get expected output. Then I
come to the Running replicated zookeeper

On each machine in my cluster(debian-0, debian-1, debian-5), I append the
following lines to zoo.cfg, and create in dataDir a myid which contains
the server id(1 for debian-0, 2 for debian-1, 3 for debian-5).

server.1=debian-0:2888:3888
server.2=debian-1:2888:3888
server.3=debian-5:2888:3888

then I start zookeeper server by running bin/zkServer.sh start, and I got
the following output:

cl...@debian-0:~/zookeeper$ bin/zkServer.sh start
JMX enabled by default
Using config: /home/cloud/zookeeper-3.2.1/bin/../conf/zoo.cfg
Starting zookeeper ...
STARTED
cl...@debian-0:~/zookeeper$ 2009-09-23 15:30:27,976 - INFO
 [main:quorumpeercon...@80] - Reading configuration from:
/home/cloud/zookeeper-3.2.1/bin/../conf/zoo.cfg
2009-09-23 15:30:27,981 - INFO  [main:quorumpeercon...@232] - Defaulting to
majority quorums
2009-09-23 15:30:28,009 - INFO  [main:quorumpeerm...@118] - Starting quorum
peer
2009-09-23 15:30:28,034 - INFO  [Thread-1:quorumcnxmanager$liste...@409] -
My election bind port: 3888
2009-09-23 15:30:28,045 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@487] - LOOKING
2009-09-23 15:30:28,070 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@579] - New election:
-1
2009-09-23 15:30:28,075 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@618] - Notification:
1, -1, 1, 1, LOOKING, LOOKING, 1
2009-09-23 15:30:28,075 - WARN  [WorkerSender Thread:quorumcnxmana...@336] -
Cannot open channel to 2 at election address debian-1/172.20.53.86:3888
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect(Native Method)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
at java.lang.Thread.run(Thread.java:619)
2009-09-23 15:30:28,085 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@642] - Adding vote
2009-09-23 15:30:28,099 - WARN  [WorkerSender Thread:quorumcnxmana...@336] -
Cannot open channel to 3 at election address debian-5/172.20.14.194:3888
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect(Native Method)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
at java.lang.Thread.run(Thread.java:619)
2009-09-23 15:30:28,288 - WARN
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorumcnxmana...@336] - Cannot open
channel to 2 at election address debian-1/172.20.53.86:3888
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect(Native Method)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:356)
at
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:603)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)

Terminal keeps on outputing the WARN info until I stop the zookeeper Server.

I googled zookeeper cannot open channel to at address and searched in
mailing list archives, but got nothing helpful.

I need your help, thanks and best regards!
  




RE: A question about Connection timed out and operation timeout

2009-08-20 Thread Benjamin Reed
are you using the single threaded or multithreaded C library? the exceeded 
deadline message means that our thread was supposed to get control after a 
certain period, but we got control that many milliseconds late. what is your 
session timeout?

ben


From: Qian Ye [yeqian@gmail.com]
Sent: Thursday, August 20, 2009 3:17 AM
To: zookeeper-user
Subject: A question about Connection timed out and operation timeout

Hi guys:

I met the problem again: an ephemeral node disappeared, and I found it
because my application got a operation timeout

My application which created an ephemeral node at the zookeeper server,
printed the following log

*WARNING: 08-20 03:09:20:  auto * 182894118176
[logid:][reqip:][auto_exchanger_zk_basic.cpp:605]get children
fail.[/forum/elect_nodes][-7][operation timeout]*

and the Zookeeper client printed the following log (the log level is INFO)

2009-08-19 21:36:18,067:3813(0x9556c520):zoo_i...@log_env@545: Client
environment:zookeeper.version=zookeeper C client 3.2.0
606 2009-08-19 21:36:18,067:3813(0x9556c520):zoo_i...@log_env@549:
Client environment:host.name=jx-ziyuan-test00.jx.baidu.com
607 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@log_env@557:
Client environments.name=Linux
608 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@log_env@558:
Client environments.arch=2.6.9-52bs
609 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@log_env@559:
Client environments.version=#2 SMP Fri Jan 26 13:34:38 CST 2007
610 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@log_env@567:
Client environment:user.name=club
611 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@log_env@577:
Client environment:user.home=/home/club
612 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@log_env@589:
Client environment:user.dir=/home/club/user/luhongbo/auto-exchanger
613 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@zookeeper_init@613:
Initiating client connection,
host=127.0.0.1:2181,127.0.0.1:2182sessionTimeout=2000 wa
tcher=0x408c56 sessionId=0x0
sessionPasswd=null context=(nil) flags=0
614 2009-08-19 21:36:18,069:3813(0x41401960):zoo_i...@check_events@1439:
initiated connection to server [127.0.0.1:2181]
615 2009-08-19 21:36:18,070:3813(0x41401960):zoo_i...@check_events@1484:
connected to server [127.0.0.1:2181] with session id=1232c1688a20093
616 2009-08-20
02:48:01,780:3813(0x41401960):zoo_w...@zookeeper_interest@1335:
Exceeded deadline by 520ms
617 2009-08-20
03:08:52,332:3813(0x41401960):zoo_w...@zookeeper_interest@1335:
Exceeded deadline by 14ms
618 2009-08-20
03:09:04,666:3813(0x41401960):zoo_w...@zookeeper_interest@1335:
Exceeded deadline by 48ms
619 2009-08-20
03:09:09,733:3813(0x41401960):zoo_w...@zookeeper_interest@1335:
Exceeded deadline by 24ms
620 *2009-08-20
03:09:20,289:3813(0x41401960):zoo_w...@zookeeper_interest@1335: Exceeded
deadline by 264ms*
621 *2009-08-20
03:09:20,295:3813(0x41401960):zoo_er...@handle_socket_error_msg@1388: Socket
[127.0.0.1:2181] zk retcode=-7, errno=110(Connection timed out): conn
ection timed out (exceeded timeout by 264ms)*
622 *2009-08-20
03:09:20,309:3813(0x41401960):zoo_w...@zookeeper_interest@1335: Exceeded
deadline by 284ms*
623 *2009-08-20
03:09:20,309:3813(0x41401960):zoo_er...@handle_socket_error_msg@1433: Socket
[127.0.0.1:2182] zk retcode=-4, errno=111(Connection refused):
server refused to accept the client*
624 *2009-08-20 03:09:20,353:3813(0x41401960):zoo_i...@check_events@1439:
initiated connection to server [127.0.0.1:2181]*
625 *2009-08-20 03:09:20,552:3813(0x41401960):zoo_i...@check_events@1484:
connected to server [127.0.0.1:2181] with session id=1232c1688a20093*

I don't know why the connection timed out happened at *2009-08-20
03:09:20,295:3813, *and the server refuse to accept the client. Could some
one give me any hints? And I'm not sure the meaning of Exceeded deadline by
xxms, need some help too.


P.S. I used the Zookeeper 3.2.0 (Server and C Client API) and run a
stand-alone instance

Thx all~

--
With Regards!

Ye, Qian
Made in Zhejiang University


Re: Errors when run zookeeper in windows ?

2009-08-19 Thread Benjamin Reed
good point david! zhang can you try david's scripts? we should probably 
commit those. thanx for pointing them out david.


ben

David Bosschaert wrote:

FWIW, I've uploaded some Windows versions of the zookeeper scripts to
https://issues.apache.org/jira/browse/ZOOKEEPER-426 a while ago. They
run from the ordinary windows shell, so no need for Cygwin or anything
like that. I'm using Zookeeper from Windows all the time and they work
fine for me.

I did notice that the scripts didn't get included in the latest 3.2.0
release. It might be worth putting some Windows scripts in the next
release as nothing in Zookeeper is unix specific (except for the
scripts ;)

Best regards,

David

2009/8/19 zhang jianfeng zjf...@gmail.com:
  

Yes,I am using cygwin and JDK 1.6,

the command to start HBase is the same as in the get started:
bin/zkServer.sh start

The following is the whole message:

zjf...@zjf ~/zookeeper-3.1.1
$ *bin/zkServer.sh start*
JMX enabled by default
Starting zookeeper ... STARTED

zjf...@zjf ~/zookeeper-3.1.1
$ java.lang.NoClassDefFoundError:
Files\Java\jre6\lib\ext\QTJava/zip;D:\Java\lib\hadoop-0/18/0\build\tools:/home/zjffdu/zookeeper-3/1/1/binzookeeper-3/1/1/jar:/home/zjffdu/zookeeper-3/1/1/binlib/junit-4/4/jar:/home/zjffdu/zookeeper-3/1/1/binlib/log4j-1/2/15/jar:/home/zjffdu/zookeeper-3/1/1/binsrc/java/lib/junit-4/4/jar:/home/zjffdu/zookeeper-3/1/1/binsrc/java/lib/log4j-1/2/15/jar
Caused by: java.lang.ClassNotFoundException:
Files\Java\jre6\lib\ext\QTJava.zip;D:\Java\lib\hadoop-0.18.0\build\tools:.home.zjffdu.zookeeper-3.1.1.binzookeeper-3.1.1.jar:.home.zjffdu.zookeeper-3.1.1.binlib.junit-4.4.jar:.home.zjffdu.zookeeper-3.1.1.binlib.log4j-1.2.15.jar:.home.zjffdu.zookeeper-3.1.1.binsrc.java.lib.junit-4.4.jar:.home.zjffdu.zookeeper-3.1.1.binsrc.java.lib.log4j-1.2.15.jar
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(Unknown Source)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   at java.lang.ClassLoader.loadClassInternal(Unknown Source)
Could not find the main class:
Files\Java\jre6\lib\ext\QTJava.zip;D:\Java\lib\hadoop-0.18.0\build\tools:/home/zjffdu/zookeeper-3.1.1/bin/../zookeeper-3.1.1.jar:/home/zjffdu/zookeeper-3.1.1/bin/../lib/junit-4.4.jar:/home/zjffdu/zookeeper-3.1.1/bin/../lib/log4j-1.2.15.jar:/home/zjffdu/zookeeper-3.1.1/bin/../src/java/lib/junit-4.4.jar:/home/zjffdu/zookeeper-3.1.1/bin/../src/java/lib/log4j-1.2.15.jar.
Program will exit.
$



Thank you

Jeff zhang


On Tue, Aug 18, 2009 at 12:53 PM, Patrick Hunt ph...@apache.org wrote:



you are using java 1.6 right? more detail on the class not found would be
useful (is that missing or just not included in your email?) Also the
command line you're using to start the app would be interesting.

Patrick


Mahadev Konar wrote:

  

Hi Zhang,
 Are you using cygwin?

mahadev


On 8/17/09 11:23 PM, zhang jianfeng zjf...@gmail.com wrote:

 Hi all,


I tried to run zookeeper in windows, but the following errors appears:


/*

**
  

*



$ java.lang.NoClassDefFoundError:

Files\Java\jre6\lib\ext\QTJava/zip;D:\Java\lib\hadoop-0/18/0\build\tools:/home

/zjffdu/zookeeper-3/1/1/binzookeeper-3/1/1/jar:/home/zjffdu/zookeeper-3/1/

1/binlib/junit-4/4/jar:/home/zjffdu/zookeeper-3/1/1/binlib/log4j-1/2/1

5/jar:/home/zjffdu/zookeeper-3/1/1/binsrc/java/lib/junit-4/4/jar:/home/zjf
fdu/zookeeper-3/1/1/binsrc/java/lib/log4j-1/2/15/jar
Caused by: java.lang.ClassNotFoundException:

Files\Java\jre6\lib\ext\QTJava.zip;D:\Java\lib\hadoop-0.18.0\build\tools:.home

.zjffdu.zookeeper-3.1.1.binzookeeper-3.1.1.jar:.home.zjffdu.zookeeper-3.1.

1.binlib.junit-4.4.jar:.home.zjffdu.zookeeper-3.1.1.binlib.log4j-1.2.1

5.jar:.home.zjffdu.zookeeper-3.1.1.binsrc.java.lib.junit-4.4.jar:.home.zjf
fdu.zookeeper-3.1.1.binsrc.java.lib.log4j-1.2.15.jar
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(Unknown Source)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   at java.lang.ClassLoader.loadClassInternal(Unknown Source)
Could not find the main class:

Files\Java\jre6\lib\ext\QTJava.zip;D:\Java\lib\hadoop-0.18.0\build\tools:/home

/zjffdu/zookeeper-3.1.1/bin/../zookeeper-3.1.1.jar:/home/zjffdu/zookeeper-3.1.

1/bin/../lib/junit-4.4.jar:/home/zjffdu/zookeeper-3.1.1/bin/../lib/log4j-1.2.1


RE: exist return true before event comes in

2009-08-03 Thread Benjamin Reed
I assume you are calling the synchronous version of exists. The callbacks for 
both the watches and async calls are processed by a callback thread, so the 
ordering is strict. Synchronous call responses are not queued to the callback 
thread. (this allows you to make synchronous calls in callbacks without 
deadlocking.) thus the effect you are seeing may be due to a backed up callback 
queue and/or thread scheduling.

ben

Sent from my phone.

-Original Message-
From: Stefan Groschupf s...@101tec.com
Sent: Monday, August 03, 2009 9:31 PM
To: zookeeper-user@hadoop.apache.org zookeeper-user@hadoop.apache.org
Subject: exist return true before event comes in


Hi,

I'm running into following problem writing a fasade for Zk Client 
(http://github.com/joa23/zkclient/
)

1.) Subscribe a watch via exist(path, true) for a path.
2.) Create a persistent node.
3.) Call exist and it returns true
4.) Zookeeper sends a NodeCreated event.


I would expect that the client would get the NodeCreated event before
exist returns true.
Does anyone has a idea of a pattern that secures that exist return
false, before the event is triggered?
Thanks,
Stefan



RE: c client header location

2009-08-02 Thread Benjamin Reed
Or maybe /usr/local/include/zookeeper but either way c-client-src is weird. 
Please open a jira.

Thanx
ben

Sent from my phone.

-Original Message-
From: Michi Mutsuzaki mi...@cs.stanford.edu
Sent: Saturday, August 01, 2009 6:15 PM
To: zookeeper-user@hadoop.apache.org zookeeper-user@hadoop.apache.org
Subject: c client header location


Hello,

Why do the headers get installed in /usr/local/include/c-client-src?
Shouldn't it go to /usr/local/include?

Thanks!
--Michi


Re: Zookeeper WAN Configuration

2009-07-25 Thread Benjamin Reed
the processing of the write transaction is described in the zookeeper 
internals presentation on 
http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations i think 
other presentations may also touch on it. we also have it in the 
ZooKeeper documentation: 
http://hadoop.apache.org/zookeeper/docs/r3.2.0/zookeeperInternals.html


ben


Todd Greenwood wrote:

Flavio  Ted, thank you for your comments.

So it sounds like the only way to currently deploy to the WAN is to
deploy ZK Servers to the central DC and open up client connections to
these ZK servers from the edge nodes. True?

In the future, once the Observers feature is implemented, then we should
be able to deploy zk servers to both the DC and to the pods...with all
the goodness that Flavio mentions below.

Flavio - do you have a doc that describes exactly what happens in the
transaction of a write operation? For instance, I'd like to know at
exactly what stage a write has been commited to the ensemble, and not
just the zk server the client is connected to. I figure it must be
something like:

clientA.write(path, value)
- serverA writes to memory
- serverA writes to transacted disk every n/seconds or m/bytes
- serverA sends write to Leader
- Leader stamps with transaction id
- Leader responds to ensemble with update + transaction id

-Todd

-Original Message-
From: Flavio Junqueira [mailto:f...@yahoo-inc.com] 
Sent: Friday, July 24, 2009 4:50 PM

To: zookeeper-user@hadoop.apache.org
Subject: Re: Zookeeper WAN Configuration

Just a few quick observations:

On Jul 24, 2009, at 4:40 PM, Ted Dunning wrote:

  

On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
to...@audiencescience.comwrote:



Could you explain the idea behind the Observers feature, what this
concept is supposed to address, and how it applies to the WAN
configuration problem in particular?

  
Not really.  I am just echoing comments on observers from them that  
know.





Without observers, increasing the number of servers in an ensemble  
enables higher read throughput, but causes write throughput to drop  
because the number of votes to order each write operation increases.  
Essentially, observers are zookeeper servers that don't vote when  
ordering updates to the zookeeper state. Adding observers enables  
higher read throughput affecting minimally write throughput (leader  
still has to send commits to everyone, at least in the version we have  
been working on).


  


The ideas for federating ZK or allowing observers would likely do  
what

you
want.  I can imagine that an observer would only care that it can see
it's
local peers and one of the observers would be elected to get updates
(and
thus would care about the central service).

This certainly sounds like exactly what I want...Was this  
introduced in

3.2 in full, or only partially?

  
I don't think it is even in trunk yet.  Look on Jira or at the  
recent logs

of this mailing list.



It is not on trunk yet.

-Flavio

  




Re: Multiple ZK clusters or a single, shared cluster?

2009-07-17 Thread Benjamin Reed
we designed zk to have high performance so that it can be shared by 
multiple applications. the main thing is that you use dedicated zk 
machines (with a dedicated disk for logging). once you have that in 
place, watch the load on your cluster, as long as you aren't saturating 
the cluster you should share.


as you point out running multiple clusters is a hardware investment, 
plus you miss out on opportunities to improve reliability. for example, 
if you have three applications that have a cluster of 3 zk servers each, 
one failure will result in an outage. if instead of using the 9 servers 
you have the same three applications use a zk cluster with 7 servers you 
can tolerate three failures without an outage.


the key of course is to make sure that you don't oversubscribe the server.

ben

Jonathan Gray wrote:

Hey guys,

Been using ZK indirectly for a few months now in the HBase and Katta 
realms.  Both of these applications make it really easy so you don't 
have to be involved much with managing your ZK cluster to support it.


I'm now using ZK for a bunch of things internally, so now I'm manually 
configuring, starting, and managing a cluster.


What advice is there about whether I should be sharing a single cluster 
between all my applications, or running separate ones for each use?


I've been told that it's strongly recommended to run your ZK nodes 
separately from the application using them (this is actually what we're 
telling new users over in HBase, though a majority of installations will 
likely co-host them with DataNodes and RegionServers).


I don't have the resources to maintain a separate 3+ node ZK cluster for 
each of my applications, so this is not really an option.  I'm trying to 
decide if I should have HBase running/managing it's own ZK cluster that 
is co-located with some of the regionservers (there will be ample 
memory, but ZK will not have a dedicated disk), or if I should be 
pointing it to a dedicated 3 node ZK cluster.


I would then also have Katta pointing at this same shared cluster (or a 
separate cluster would be co-located with katta nodes).  Same for my 
application; could share nodes with the app servers or pointed at a 
single ZK cluster.


Trade-offs I should be aware of?  Current best practices?

Any help would be much appreciated.  Thanks.

Jonathan Gray
  




Re: Question about the sequential flag on create.

2009-07-14 Thread Benjamin Reed
the create is atomic. we just use a data structure that does not store 
the list of children in order.


ben

Erik Holstad wrote:

Hey Patrik!
Thanks for the reply.
I understand all the reasons that you posted above and totally agree that
nodes should not be sorted since you then have to pay that overhead for
every node, even though you might not need or want it.
I just thought that it might be possible to create a sequential node
atomically, but I guess that is not how it works?

Regards Erik
  




Re: Confused about KeeperState.Disconnected and KeeperState.Expired

2009-06-24 Thread Benjamin Reed

sorry to jump in late.

if i understand the scenario correctly, you are partitioned from ZK, but 
you still have access to the NN on which you are holding leases to 
files. the problem is that even though your ephemeral nodes may timeout, 
you are still holding a lease on the NN and recovery would go faster if 
you actually closed the file. right? or is it deeper than that? can you 
open a file in such a way that you stomp the lease? or make sure that 
the lease timeout is smaller than the session timeout and only renew if 
you are still connected to ZK?


thanx
ben

Jean-Daniel Cryans wrote:

If the machine was completely partitioned, as far as I know, it would lose
it's lease so the only thing we have to make sure about is clearing the
state of the region server by doing a restart so that it's ready to come
back in the cluster. If ZK is down but the rest is up, closing the files in
HDFS should ensure that we lose a minimum of data if not losing any.

I think that in a multi-rack setup it is possible to not be able to talk to
ZK but to be able to talk to the Namenode as machines can be anywhere.
Especially in HBase 0.20, the master can failover on any node that has a
backup Master ready. So in that case, the region server should consider
itself gone from the cluster and close any connection it has and restart.

Those are very legetimate questions Gustavo, thanks for asking.

J-D

On Wed, Jun 24, 2009 at 3:38 PM, Gustavo Niemeyer gust...@niemeyer.netwrote:

  

Ben's opinion is that it should not belong in the default API but in the
common client that another recent thread was about. My opinion is just
  

that


I need such a functionality, wherever it is.
  

Understood, sorry.  I just meant that it feels like something that
would likely be useful to other people too, so might have a role in
the default API to ensure it gets done properly considering the
details that Ben brought up.



If the node gets the exception (or has it's own timer), as I wrote, it
  

will


shut itself down to release HDFS leases as fast as possible. If ZK is
  

really


down and it's not a network partition, then HBase is down and this is
  

fine


because it won't be able to work anyway.
  

Right, that's mostly what I was wondering.  I was pondering about
under which circumstances the node would be unable to talk to the
ZooKeeper server but would still be holding the HDFS lease in a way
that prevented the rest of the system from going on.  If I understand
what you mean, if ZooKeeper is down entirely, HBase would be down for
good. If the machine was partitioned off entirely, the HDFS side of
things will also be disconnected, so shutting the node down won't help
the rest of the system recovering.

--
Gustavo Niemeyer
http://niemeyer.net






Re: ZooKeeper heavy CPU utilisation

2009-06-02 Thread Benjamin Reed

can you attach the jstack output? it seems to be missing from your email.

ben

Satish Bhatti wrote:
I am running a 5 node ZooKeeper cluster and I noticed that one of them 
has very high CPU usage:


 PID   USER  PR  NI  VIRT  RES  SHR S   %CPU %MEMTIME+   COMMAND 
 6883  infact   22   0   725m  41m  4188 S   95   0.5 
 5671:54  java


It is not doing anything application-wise at this point, so I was 
wondering why the heck it's using up so much CPU!  I have attached a 
jstack logfile to this email.


Satish





Re: Some thoughts on Zookeeper after using it for a while in the CXF/DOSGi subproject

2009-05-29 Thread Benjamin Reed

this is great to hear. it's great to see siblings playing together ;)


* In CXF we use Maven to build everything. To depend on Zookeeper we
need to pull it in from a Maven repository. I couldn't find Zookeeper
in any main Maven repos, so currently we're pulling it in from
http://people.apache.org/~chirino/zk-repo (a private repo), which is
not ideal. Would there be any chance of getting the zookeeper.jar file
deployed to one of the main Maven repo's (e.g.
http://repo2.maven.org/maven2/)?
  
yeah this is an increasing thorn in our side. some of us would like to 
convert to maven, but we are tied to the hadoop build process since we 
reuse all of their build/test infrastructure. we will probably be using 
ivy to connect to maven repositories.

* To use Zookeeper from within OSGi it has to be turned into an OSGi
bundle. Doing this is not hard and it's currently done in our
buildsystem [1]. However, I think it would make sense to have this
done somewhere in the Zookeeper buildsystem. Matter of fact I think
you should be able to release a single zookeeper.jar that's both an
ordinary jar and an OSGi bundle so it would work in both cases...
  

i completely agree. please open a jira and submit a patch.

* The Zookeeper server is currently started with the zkServer.sh
script, but I think it would make sense to also allow it to run inside
an OSGi container, simply by starting a bundle. Has anyone ever done
any work in this regard? If not I'm planning to spend some time and
try to make this work.
  
we have a current open jira about making it possible to embed the 
zookeeper server in other applications. the big problem is the 
System.exits that we have sprinkled around. it shouldn't be hard to make 
happen since we start and stop the server in our unit tests.

* BTW I made some Windows versions of the zkCli/zkEnv/zkServer
scripts. Interested in taking these?
  

excellent. please submit a jira and patch!

i'm so glad you are working on this. i've been thinking for a long time 
that ZooKeeper would fit really well with OSGi, but i haven't had time 
to work on it. thank you!


ben



RE: NodeChildrenChanged WatchedEvent

2009-05-11 Thread Benjamin Reed
good summary ted. just to add a bit. another motivation for the current design 
is what scott had mentioned earlier: not sending a flood of changes when the 
value of a node is changing rapidly. implicit in this is the fact that we do 
not send the value in the events. not only does this make the events much more 
heavy weight, but it also leads to bad programming practices (see the faq). 
since we don't send data in the events, sending 3 data changed events in a 
row is the same as just sending the last data changed event.

i also agree with ted about the wrappers. unless they are used to implement a 
new construct, usually they just introduce bugs. however, there are two things 
i want to point out. first, the current exception handling ranges from a pain 
to, in the case of create() with SEQUENTIAL and EPHEMERAL, almost impossible, 
so we want to make connecting recovery a bit more sophisticated; when a 
connection goes down, the client and server figure out what happen to the 
pending requests so that we never need to error them out with the i have no 
idea what happened exception, aka CONNECTION LOSS. second, higher level 
constructs in the form of recipes are great! for more sophisticated constructs 
it is great to have things implemented once and thoroughly debugged.

ben

ps - one other clarification in ZK 3, the watches are still tracked locally. 
it's just that in ZK 3 the client now has the ability to tell the server what 
it was watching and what was the last thing seen when it reconnects. the server 
can then figure out which watches were missed and need to be retriggered and 
which watches need to be reregistered
 
__
From: Ted Dunning [ted.dunn...@gmail.com]
Sent: Saturday, May 09, 2009 1:06 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: NodeChildrenChanged WatchedEvent

Making things better is always good.

I have found that in practice, most wrappers of ZK lead to serious errors
and should be avoided like the plague.  This particular use case is not a
big deal for me to code correctly (in Java, anyway) and I do it all the
time.

It may be that the no-persistent-watch policy was partly an artifact of the
ZK 1 and ZK 2 situation where ZK avoided keeping much of anything around per
session other than ephemeral files.  This has changed in ZK 3 and it might
be plausible to have more persistent watches.

On the other hand, I believe that Ben purposely avoided having this type of
watch to automatically throttle the number of notifications to be equal to
the rate at which the listener can handle them.  Having seen a number of
systems that didn't throttle this way up close and personal, I have lots of
empathy which that position.  Since I don't have any issue with looking at
for changes, I would tend to just go with whatever Ben suggests.  His
opinions (largely based on watching people code with ZK) are pretty danged
good.

On Sat, May 9, 2009 at 12:37 PM, Scott Carey sc...@richrelevance.comwrote:

 What I am suggesting are higher level constructs that do these repeated
 mundane tasks for you to handle those use cases where the verbosity of the
 API is a hinderance to quality and productivity.




--
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)


RE: Moving ZooKeeper Servers

2009-05-06 Thread Benjamin Reed
yes, /zookeeper is part of the reserved namespace for zookeeper internals. you 
should ignore it for such things.

ben

From: Satish Bhatti [cthd2...@gmail.com]
Sent: Wednesday, May 06, 2009 2:57 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Moving ZooKeeper Servers

I ended up going with that suggestion, a short recursive function did the
trick!  However, I noticed the following nodes:
/zookeeper
/zookeeper/quota

that were not created by me.  So I ignored them.  Is this correct?

Satish


On Mon, May 4, 2009 at 4:33 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 In fact, the much, much simpler approach of bringing up the production ZK
 cluster and simply writing a program to read from the pre-production
 cluster
 and write to the production one is much more sound.  If you can't do that,
 you may need to rethink your processes since they are likely to be delicate
 for other reasons as well.

 On Mon, May 4, 2009 at 2:35 PM, Mahadev Konar maha...@yahoo-inc.com
 wrote:

  So, zookeeper would work fine if you are careful with above but I would
  vote
  against doing this for production since the above is pretty easy to mess
  up.
 



 --
 Ted Dunning, CTO
 DeepDyve

 111 West Evelyn Ave. Ste. 202
 Sunnyvale, CA 94086
 www.deepdyve.com
 858-414-0013 (m)
 408-773-0220 (fax)



Re: Unique Id Generation

2009-04-24 Thread Benjamin Reed
i'm not exactly clear how you use these ideas, but one source of unique 
ids that are longs is the zxid. if you create a znode, everytime you 
write to it, you will get a unique zxid in the mzxid member of the stat 
structure. (you get the stat structure back in the response to the setData.)


ben

Mahadev Konar wrote:

Hi Satish,
 Most of the sequences (versions of nodes ) and the sequence flags are ints.
We do have plans to move it to long.
But in your case I can imagine you can split a long into 2 32 bits -

Parent (which is int) - child(which is int)
Now after you run out of child epehemarls then you should create a node
Parent + 1
Remove parent 
And then start creating an ephemeral child


(so parent (32 bits) and child (32 bits)) would form a long.

I don't think this should be very hard to implement. Their is nothing in
zookeeper (out of the box) currently that would help you out.

Mahadev
 
On 4/23/09 4:52 PM, Satish Bhatti cthd2...@gmail.com wrote:


  

We currently use a database sequence to generate unique ids for use by our
application.  I was thinking about using ZooKeeper instead so I can get rid
of the database.  My plan was to use the sequential id from ephemeral nodes,
but looking at the code it appears that this is an int, not a long.  Is
there any other straightforward way to generate ids using ZooKeeper?
Thanks,

Satish



  




RE: Semantics of ConnectionLoss exception

2009-03-26 Thread Benjamin Reed
it is possible for the time to pass without the session expiring. Imagine a 
session timeout of 15 seconds. there is correlated power outage affecting the 
zookeeper servers. lets say it takes 5 minutes to recover power and reboot. 
when the service recovers, it resets expiration times, so when the servers 
start back up and the client reconnects (assuming it is retrying every few 
seconds), the session will be recovered and everything will proceed as normal. 
if the client library generates a session expired, the client could connect 
with a new session after the service recovers and see its own ephemeral nodes 
for 15 seconds.

ben

From: Nitay [nit...@gmail.com]
Sent: Thursday, March 26, 2009 12:09 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Semantics of ConnectionLoss exception

Why is it done that way? How am I supposed to reliably detect that my
ephemeral nodes are gone? Why not deliver the Session Expired event on the
client side after the right time has passed without communication to any
server?

On Thu, Mar 26, 2009 at 10:58 AM, Mahadev Konar maha...@yahoo-inc.comwrote:

 
  Isn't it the case that the client won't get session expired until it's
  able to connect to a server, right? So what might happen is that the
  client loses connection to the server, the server eventually expires the
  client and deletes ephemerals (notifying all watchers) but the client
  won't see the session expiration until it is able to reconnect to one
  of the servers. ie the client doesn't know it's been expired until it's
  able to reconnect to the cluster, at which point it's notified that it's
  been expired.
 You are right pat!

 mahadev

 
 
 http://hadoop.apache.org/zookeeper/docs/r3.0.1/zookeeperProgrammers.html
  Has this information scattered around, but we should put it in the FAQ
  specifically.
 
  3.0.1 is a bit old, try this for the latest docs:
 
 http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html
 
  - Is the ZooKeeper handle I'm using dead after this event?
  Again no. your handle is valid until you get an session expiry event or
 you
  do a zoo_close on your handle.
 
 
  Thanks
  mahadev
 
 
 
 
  On 3/25/09 5:42 PM, Nitay nit...@gmail.com wrote:
 
  I'm a little unclear about the ConnectionLoss exception as it's
 described in
  the FAQ and would like some clarification.
 
  From the state diagram, http://wiki.apache.org/hadoop/ZooKeeper/FAQ#1,
 there
  are three events that cause a ConnectionLoss:
 
  1) In Connecting state, call close().
  2) In Connected state, call close().
  3) In Connected state, get disconnected.
 
  It's the third one I'm unclear about.
 
  - Does this event happening mean my ephemeral nodes will go away?
  - Is the ZooKeeper handle I'm using dead after this event? Meaning
 that,
  similar to the SessionExpired case, I need to construct a new
 connection
  handle to ZooKeeper and take care of the restarting myself. It seems
 from
  the diagram that this should not be the case. Rather, seeing as the
  disconnected event sends the user back to the Connecting state, my
 handle
  should be fine and the library will keep trying to reconnect to
 ZooKeeper
  internally? I understand my current operation may have failed, what I'm
  asking about is future operations.
 
  Thanks,
  -n
 




Re: How large an ensemble can one build with Zookeeper?

2009-03-06 Thread Benjamin Reed
I realize this is discussion is over, but i did want to make one quick 
clarification. when we talk about ensembles, we are talking about the 
servers that make up the zookeeper service. we refer to the servers that 
use the zookeeper service as clients. we have systems here that use 
ensembles of five servers to provide zookeeper service to thousands of 
client servers without problem.


ben

Chad Harrington wrote:

Clearly Zookeeper can handle ensembles of a dozen or so servers.  How large
an ensemble can one build with Zookeeper?  100 servers?  10,000 servers?
Are there limitations that make the system unusable at large numbers of
servers?

Thanks,

  




Re: Contrib section (nee Re: A modest proposal for simplifying zookeeper :)

2009-02-27 Thread Benjamin Reed
i'm ready to reevaluate it. i did the contrib for fatjar and it was 
extremely painful! (and that was an extremely simple contrib!) we really 
want to ramp up the contribs and get a bunch of recipe implementations 
in, so we need something that makes it really easy. i'm not a fan of 
maven (they seem to have chosen a convention that is convenient for the 
build tool rather the developer), but it is widely used and i we need 
something better, so i'm certainly considering it.


ben

Anthony Urso wrote:

Speaking of the contrib section, what is the status of ZOOKEEPER-103?
Is it ready to be reevaluated now that 3.0 is out?

Cheers,
Anthony

On Fri, Jan 9, 2009 at 11:58 AM, Mahadev Konar maha...@yahoo-inc.com wrote:
  

Hi Kevin,
 It would be great to have such high level interfaces. It could be
something that you could contribute :) . We havent had the bandwidth to
provide such interfaces for zookeeper. It would be great to have all such
recipes as a part of contrib package of zookeeper.

mahadev

On 1/9/09 11:44 AM, Kevin Burton bur...@spinn3r.com wrote:



OK so it sounds from the group that there are still reasons to provide
rope in ZK to enable algorithms like leader election.
Couldn't ZK ship higher level interfaces for leader election, mutexes,
semapores, queues, barriers, etc instead of pushing this on developers?

Then the remaining APIs, configuration, event notification, and discovery,
can be used on a simpler, rope free API.

The rope is what's killing me now :)

Kevin
  





Re: Adding a server to a running ensemble?

2009-02-27 Thread Benjamin Reed
You can do this today by propagating a new configuration to the new and 
old servers and then restarting them. the bounce should take around a 
second and to the clients it should look like a server failure and then 
a reconnect. you shouldn't lose any sessions and everything should just 
recover.


we do have an open issue to do this more on the fly without having to do 
the bounce, but it is behind other priorities in the work queue.


ben

Chad Harrington wrote:

We are investigating Ensemble and a key question came up: How does one add a
server to a running ensemble of Zookeeper servers in a 24/7 environment?  If
I have a 3-server ensemble and traffic grows to the point where I need
another 2 servers, how do I add them without shutting everything down and
restarting?

Thanks for your help,

Chad Harrington
CEO
DataScaler, Inc.
charring...@datascaler.com
201A Ravendale Dr.
Mountain View, CA  94043
Phone: 650-515-3437
Fax: 650-887-1544
  




Re: Contrib section (nee Re: A modest proposal for simplifying zookeeper :)

2009-02-27 Thread Benjamin Reed
just to be clear: i'm not a maven fan, but i'm not sure anything else is 
better. buildr looks better flexibility wise, but i think maven is much 
more popular and mature. with ivy we are still stuck with ant build files.


ben

Patrick Hunt wrote:
Ben, you might want to look at buildr, it recently graduated from the 
apache incubator:

http://buildr.apache.org/

Buildr is a build system for Java applications. We wanted something 
that’s simple and intuitive to use, so we only need to tell it what to 
do, and it takes care of the rest. But also something we can easily 
extend for those one-off tasks, with a language that’s a joy to use. And 
of course, we wanted it to be fast, reliable and have outstanding 
dependency management.


Also Ivy just released version 2.0.

If you have a specific idea and would like to start working on this 
please create a JIRA to discuss/track/vote/etc... Be aware that the 
contribution process, release process and other documentation would have 
to be updated as part of this. For example if we want to push jars to an 
artifact repo the artifacts/pom/etc... would have to be voted on as part 
of the release process.


Patrick

Benjamin Reed wrote:
  
i'm ready to reevaluate it. i did the contrib for fatjar and it was 
extremely painful! (and that was an extremely simple contrib!) we really 
want to ramp up the contribs and get a bunch of recipe implementations 
in, so we need something that makes it really easy. i'm not a fan of 
maven (they seem to have chosen a convention that is convenient for the 
build tool rather the developer), but it is widely used and i we need 
something better, so i'm certainly considering it.


ben

Anthony Urso wrote:


Speaking of the contrib section, what is the status of ZOOKEEPER-103?
Is it ready to be reevaluated now that 3.0 is out?

Cheers,
Anthony

On Fri, Jan 9, 2009 at 11:58 AM, Mahadev Konar maha...@yahoo-inc.com 
wrote:
 
  

Hi Kevin,
 It would be great to have such high level interfaces. It could be
something that you could contribute :) . We havent had the bandwidth to
provide such interfaces for zookeeper. It would be great to have all 
such

recipes as a part of contrib package of zookeeper.

mahadev

On 1/9/09 11:44 AM, Kevin Burton bur...@spinn3r.com wrote:

   

OK so it sounds from the group that there are still reasons to 
provide

rope in ZK to enable algorithms like leader election.
Couldn't ZK ship higher level interfaces for leader election, mutexes,
semapores, queues, barriers, etc instead of pushing this on developers?

Then the remaining APIs, configuration, event notification, and 
discovery,

can be used on a simpler, rope free API.

The rope is what's killing me now :)

Kevin
  
  






RE: Recommended session timeout

2009-02-26 Thread Benjamin Reed
just a quick sanity check. are you sure your memory is not overcommitted? in 
other words you aren't swapping. since the gc does a bunch of random memory 
accesses if you swap at all things will go very slow.

ben

From: Joey Echeverria [joe...@gmail.com]
Sent: Thursday, February 26, 2009 1:31 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Recommended session timeout

I've answered the questions you asked previously below, but I thought
I would open with the actual culprit now that we found it. When I said
loading data before, what I was talking about was sending data via
Thrift to the machine that was getting disconnected from zookeeper.
This turned out to be the problem. Too much data was being sent in
short span of time and this caused memory pressure on the heap. This
increased the fraction of the time that the GC had to run to keep up.
During a 143 second test, the GC was running for 33 seconds.

We found this by running tcpdump on both the machine running the
ensemble server and the machine connecting to zookeeper as a client.
We deduced it wasn't a network (lost packet) issue, as we never saw
unmatched packets in our tests. What did see were long 2-7 second
pauses with no packets being sent. We first attempted to up the
priority of the zookeeper threads to see if that would help. When it
didn't, we started monitoring the GC time. We don't have a work around
yet, other than sending data in smaller batches and  using a longer
sessionTimeout.

Thanks for all your help!

-Joey

 As an experiment try increasing the timeout to say 30 seconds and re-run
 your tests. Any change?

30 seconds and higher works fine.

 loading data - could you explain a bit more about what you mean by this?
 If you are able to provide enough information for us to replicate we could
 try it out (also provide info on your ensemble configuration as Mahadev
 suggested)

The ensemble config file looks as follows:

tickTime=2000
dataDir=/data/zk
clientPort=2181
initLimit=5
syncLimit=2
skipACL=true

server.1=server1:2888:3888
...
server.7=server7:2888:3888

 You are referring to startConnect in SendThread?

 We randomly sleep up to 1 second to ensure that the clients don't all storm
 the server(s) after a bounce.

That makes some sense, but it might be worth tweaking that parameter
based on sessionTimeout since 1 second can easily be 10-20% of
sessionTimeout.

 1) configure your test client to connect to 1 server in the ensemble
 2) run the srst command on that server
 3) run your client test
 4) run the stat command on that server
 5) if the test takes some time, run the stat a few times during the test
  to get more data points

The problem doesn't appear to be on the server end as max latency
never went above 5ms. Also, no messages are shown as queued.


RE: Dealing with session expired

2009-02-12 Thread Benjamin Reed
idleness is not a problem. the client library sends heartbeats to keep the 
session alive. the client library will also handle reconnects automatically if 
a server dies.

since session expiration really is a rare catastrophic event. (or at least it 
should be.) it is probably easiest to deal with it by starting with a fresh 
instance if your session expires.

ben

From: Tom Nichols [tmnich...@gmail.com]
Sent: Thursday, February 12, 2009 11:53 AM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Dealing with session expired

I'm using a timeout of 5000ms.  Now let me ask this:  Suppose all of
my clients are waiting on some external event -- not ZooKeeper -- so
they are all idle and are not touching ZK nodes, nor are they calling
exists, getChildren, etc etc.  Can that idleness cause session expiry?

I'm running a local quorum of 3 nodes.  That is, I have an Ant script
that kicks off 3 java tasks in parallel to run ConsumerPeerMain,
each with its own config file.

Regarding handling of the failure, I suspect I will just have to
reinitialize by creating a new instance of my client(s) that
themselves will have a new ZK instance.  I'm using Spring to wire
everything together, which is why it's particularly difficult to
simply re-create a new ZK instance and pass it to the classes using it
(those classes have no knowledge of each other).  But I _can_ just
pull a freshly-created (prototype) instance from the Spring
application context, which is where a new ZK client will be wired in.

The only ramification there is I have to throw the KeeperException as
a fatal exception rather than letting that client try to re-elect.  Or
maybe add in some logic to say if I can't re-elect, _then_ throw an
exception and consider it fatal.

Thanks guys.

-Tom


On Thu, Feb 12, 2009 at 2:39 PM, Patrick Hunt ph...@apache.org wrote:
 Regardless of frequency Tom's code still has to handle this situation.

 I would suggest that the two classes Tom is referring to in his mail, the
 ones that use ZK client object, should either be able to reinitialize with
 a new zk session, or they themselves should be discarded and new instances
 created using the new session (not sure what makes more sense for his
 archi...)

 Regardless of whether we reuse the session object or create a new one I
 believe the code using the session needs to reinitialize in some way --
 there's been a dramatic break from the cluster.

 As I mentioned, you can decrease the likelihood of expiration by increasing
 the timeout - but the downside is that you are less sensitive to clients
 dying (because their ephemeral nodes don't get deleted till close/expire and
 if you are doing something like leader election among your clients it will
 take longer for the followers to be notified).

 Patrick

 Mahadev Konar wrote:

 Hi Tom,
  The session expired event means that the the server expired the client
 and
 that means the watches and ephemrals will go away for that node.

 How are you running your zookeeper quorum? Session expiry event should be
 really rare event . If you have a quorum of servers it should rarely
 happen.

 mahadev


 On 2/12/09 11:17 AM, Tom Nichols tmnich...@gmail.com wrote:

 So if a session expires, my ephemeral nodes and watches have already
 disappeared?  I suppose creating a new ZK instance with the old
 session ID would not do me any good in that case.  Correct?

 Thanks.
 -Tom



 On Thu, Feb 12, 2009 at 2:12 PM, Mahadev Konar maha...@yahoo-inc.com
 wrote:

 Hi Tom,
  We prefer to discard the zookeeper instance if a session expires.
 Maintaining a one to one relationship between a client handle and a
 session
 makes it much simpler for users to understand the existence and
 disappearance of ephemeral nodes and watches created by a zookeeper
 client.

 thanks
 mahadev


 On 2/12/09 10:58 AM, Tom Nichols tmnich...@gmail.com wrote:

 I've come across the situation where a ZK instance will have an
 expired connection and therefore all operations fail.  Now AFAIK the
 only way to recover is to create  a new ZK instance with the old
 session ID, correct?

 Now, my problem is, the ZK instance may be shared -- not between
 threads -- but maybe two classes in the same thread synchronize on
 different nodes by using different watchers.  So it makes sense that
 one ZK client instance can handle this.  Except that even if I detect
 the session expiration by catching the KeeperException, if I want to
 resume the session, I have to create a new ZK instance and pass it
 to any classes who were previously sharing the same instance.  Does
 this make sense so far?

 Anyway, bottom line is, it would be nice if a ZK instance could itself
 recover a session rather than discarding that instance and creating a
 new one.

 Thoughts?

 Thanks in advance,

 -Tom





RE: ZooKeeper 3.1 and C API/ABI

2009-02-04 Thread Benjamin Reed
you are correct we usually increment the version number on an API breakage. in 
the olden days if you called a function with less parameters than expected, a 
null would get passed. if this still happens we are ABI compatible. (i haven't 
tried it though...)

ben


From: Chris Darroch [chr...@pearsoncmg.com]
Sent: Wednesday, February 04, 2009 11:27 AM
To: zookeeper-user@hadoop.apache.org
Subject: ZooKeeper 3.1 and C API/ABI

Hi --

   I notice that 3.1.0 is on its way and it includes ZOOKEEPER-255 which
adds the Stat structure as a parameter to the zoo_set() C call.  This is
a valuable change and I don't want to hold it up.

   However, I thought I should point out that this kind of change
breaks the API and ABI.  For major Apache C projects like the APR,
such breakage is allowed only with a major version number change:

http://apr.apache.org/versioning.html#source

   Following such guidelines, I suppose, the old zoo_*set() functions
would remain as-is until 4.0.0, and parallel zoo_*set2() or
zoo_stat_*set() functions would add the new functionality.


Now, fair do's, ZooKeeper may not care as much as APR or httpd,
since it's mostly a Java project.  At a minimum, though, it would be
excellent if there was compile-time versioning information available
so that external projects could check and, at a bare minimum, fail to
compile if the API/ABI has changed.  APR has some useful guidelines
making compile-time constants (e.g., ZOO_MAJOR_VERSION) available:

http://apr.apache.org/versioning.html#vsncheck

   Speaking personally, one really nice aspect of working with APR
for me is the parallel installation framework.  Again, this might be
overkill for ZK, but I'll just point it out as well:

http://apr.apache.org/versioning.html#parallel

Chris.

--
GPG Key ID: 366A375B
GPG Key Fingerprint: 485E 5041 17E1 E2BB C263  E4DE C8E3 FA36 366A 375B


RE: Delaying 3.1 release by 2 to 3 weeks?

2009-01-16 Thread Benjamin Reed
we should delay. it would be good to try out quotas for a bit before we do the 
release. quotas are also a key part of the release. 3 weeks seem a little long 
though.

ben

From: Mahadev Konar [maha...@yahoo-inc.com]
Sent: Thursday, January 15, 2009 4:32 PM
To: zookeeper-...@hadoop.apache.org
Cc: zookeeper-user@hadoop.apache.org
Subject: Re: Delaying 3.1 release by 2 to 3 weeks?

That was release 3.1 and not 3.2 :)

mahadev


On 1/15/09 4:26 PM, Mahadev Konar maha...@yahoo-inc.com wrote:

 Hi all,
   I needed to get quotas in zookeeper 3.2.0 and wanted to see if delaying
 the release by 2-3 weeks is ok with everyone?
 Here is the jira for it -

 http://issues.apache.org/jira/browse/ZOOKEEPER-231

 Please respond if you have any issues with the delay.

 thanks
 mahadev





RE: Distributed queue: how to ensure no lost items?

2009-01-12 Thread Benjamin Reed
That is a good point. you could put a child znode of queue-X that contains the 
processing history. Like who tried to process and what time they started.

ben


From: Hiram Chirino [chir...@gmail.com]
Sent: Monday, January 12, 2009 8:48 AM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Distributed queue: how to ensure no lost items?

At least once is generally the case in queuing systems unless you can
do a distributed transaction with your consumer.  What comes in handy
in an at least once case, is letting the consumer know that a message
may have 'potentially' already been processed.  That way he can double
check first before he goes off and processes the message again.  But
adding that info in ZK might be more expensive that doing the double
check every time in consumer anyways.

On Thu, Jan 8, 2009 at 11:42 AM, Benjamin Reed br...@yahoo-inc.com wrote:
 We should expand that section. the current queue recipe guarantees that 
 things are consumed at most once. to guarantee at least the consumer creates 
 an ephemeral node queue-X-inprocess to indicate that the node is being 
 processed. once the queue element has been processed the consumer deletes 
 queue-X and queue-X-inprocess (in that order).

 using an emphemeral node means that if a consumer crashes, the *-inprocess 
 node will be deleted allowing the queue elements it was working on to be 
 consumed by someone else. putting the *-inprocess nodes at the same level of 
 the queue-X nodes allows the consumer to get the list of queue elements and 
 the inprocess flags with the same getChildren call. the *-inprocess flag 
 ensures that only one consumer is processing a given item. by deleting 
 queue-X before queue-X-inprocess we make sure that no other consumer will see 
 queue-X as available for consumption after it is processed and before it is 
 deleted.

 this is at last once, because the consumer has a race condition. the consumer 
 may process the item and then crash before it can delete the corresponding 
 queue-X node.

 ben

 -Original Message-
 From: Stuart White [mailto:stuart.whi...@gmail.com]
 Sent: Thursday, January 08, 2009 7:15 AM
 To: zookeeper-user@hadoop.apache.org
 Subject: Distributed queue: how to ensure no lost items?

 I'm interested in using ZooKeeper to provide a distributed
 producer/consumer queue for my distributed application.

 Of course I've been studying the recipes provided for queues, barriers, etc...

 My question is: how can I prevent packets of work from being lost if a
 process crashes?

 For example, following the distributed queue recipe, when a consumer
 takes an item from the queue, it removes the first item znode under
 the queue znode.  But, if the consumer immediately crashes after
 removing the item from the queue, that item is lost.

 Is there a recipe or recommended approach to ensure that no queue
 items are lost in the event of process failure?

 Thanks!




--
Regards,
Hiram

Blog: http://hiramchirino.com

Open Source SOA
http://open.iona.com


RE: Updated NodeWatcher...

2009-01-09 Thread Benjamin Reed
I'm really bad a creating figures, but i've put up something that should be 
informative. (i'm also really bad at apache wiki.) hopefully someone can make 
it more beautiful. i've added the state diagram to the FAQ: 
http://wiki.apache.org/hadoop/ZooKeeper/FAQ

ben

-Original Message-
From: adam.ros...@gmail.com [mailto:adam.ros...@gmail.com] On Behalf Of Adam 
Rosien
Sent: Thursday, January 08, 2009 8:06 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Updated NodeWatcher...

It feels like we need a flowchart, state-chart, or something, so we
can all talk about the same thing. Then people could suggest
abstractions that would essentially put a box around sections of the
diagram. However I feel woefully inadequate at the former :(.

.. Adam

On Thu, Jan 8, 2009 at 4:20 PM, Benjamin Reed br...@yahoo-inc.com wrote:
 For your first issue if an ensemble goes offline and comes back, everything 
 should be fine. it will look to the client just like a server went down. if a 
 session expires, you are correct that the client will not reconnect. this 
 again is on purpose. for the node watcher the session is unimportant, but if 
 the ZooKeeper object is also being used for leader election, for example, you 
 do not want the object to grab a new session automatically.

 For 2) i think pat responded to that one. an async request will always 
 return. if the server goes down after the request is issued, you will get a 
 connection loss error in your callback.

 Your third issued is described with the first.

 ben

 -Original Message-
 From: burtona...@gmail.com [mailto:burtona...@gmail.com] On Behalf Of Kevin 
 Burton
 Sent: Thursday, January 08, 2009 4:02 PM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Updated NodeWatcher...



 i just found that part of this thread went to my junk folder. can you send
 the URL for the NodeListener?


 Sure... here you go:

 http://pastebin.com/f1e9d3706



 this NodeWatcher is a useful thing. i have a couple of suggestions to
 simplify it:

 1) Construct the NodeWatcher with a ZooKeeper object rather than
 constructing one. Not only does it simplify NodeWatcher, but it also makes
 it so that the ZooKeeper object can be used for other things as well.


 I hear you I was thinking that this might not be a good idea because
 NodeWatcher can reconnect you to the ensemble if it goes offline.

 I'm not sure if it's a bug or not but once my session expired on the client
 it wouldn't reconnect so I just implemented my own reconnect and session
 expiry.



 2) Use the async API in watchNodeData and watchNodeExists. it simplifies
 the code and the error handling.


 The problem was that according to feedback here an async request might never
 return if the server dies shortly after the request and before it has a
 change to respond.

 I wanted NodeWatcher to hide as much rope as possible.


 3) You don't need to do a connect() in handleDisconnected(). ZooKeeper
 object will do it automatically for you.


 I can try again if you'd like by this isn't my experience.  Once the session
 expired and the whole ensemble was offline it wouldn't connect again.

 If it was a transient disconnect I'd see on disconnect event and then a
 quick reconnect.  If it was a long disconnect (with nothing to attach to)
 then ZK won't ever reconnect me.

 I'd like this to be the behavior though...


 There is an old example on sourceforge
 http://zookeeper.wiki.sourceforge.net/ZooKeeperJavaExample that may give
 you some more ideas on how to simplify your code.


 That would be nice simple is good!

 Kevin


 --
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 AIM/YIM: sfburtonator
 Skype: burtonator
 Work: http://spinn3r.com



RE: Updated NodeWatcher...

2009-01-09 Thread Benjamin Reed
yeah, i was thinking it should be in forrest, but i couldn't figure out where 
to put it. that is why i didn't close the issue.

ben

-Original Message-
From: Patrick Hunt [mailto:ph...@apache.org] 
Sent: Friday, January 09, 2009 9:37 AM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Updated NodeWatcher...

Ben this is great, thanks! Do you want to close out this one and point 
to the faq?

https://issues.apache.org/jira/browse/ZOOKEEPER-264

Although IMO this should be moved to the forrest docs.

Patrick


Benjamin Reed wrote:
 I'm really bad a creating figures, but i've put up something that should be 
 informative. (i'm also really bad at apache wiki.) hopefully someone can make 
 it more beautiful. i've added the state diagram to the FAQ: 
 http://wiki.apache.org/hadoop/ZooKeeper/FAQ
 
 ben
 
 -Original Message-
 From: adam.ros...@gmail.com [mailto:adam.ros...@gmail.com] On Behalf Of Adam 
 Rosien
 Sent: Thursday, January 08, 2009 8:06 PM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Updated NodeWatcher...
 
 It feels like we need a flowchart, state-chart, or something, so we
 can all talk about the same thing. Then people could suggest
 abstractions that would essentially put a box around sections of the
 diagram. However I feel woefully inadequate at the former :(.
 
 .. Adam
 
 On Thu, Jan 8, 2009 at 4:20 PM, Benjamin Reed br...@yahoo-inc.com wrote:
 For your first issue if an ensemble goes offline and comes back, everything 
 should be fine. it will look to the client just like a server went down. if 
 a session expires, you are correct that the client will not reconnect. this 
 again is on purpose. for the node watcher the session is unimportant, but if 
 the ZooKeeper object is also being used for leader election, for example, 
 you do not want the object to grab a new session automatically.

 For 2) i think pat responded to that one. an async request will always 
 return. if the server goes down after the request is issued, you will get a 
 connection loss error in your callback.

 Your third issued is described with the first.

 ben

 -Original Message-
 From: burtona...@gmail.com [mailto:burtona...@gmail.com] On Behalf Of Kevin 
 Burton
 Sent: Thursday, January 08, 2009 4:02 PM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Updated NodeWatcher...


 i just found that part of this thread went to my junk folder. can you send
 the URL for the NodeListener?

 Sure... here you go:

 http://pastebin.com/f1e9d3706


 this NodeWatcher is a useful thing. i have a couple of suggestions to
 simplify it:

 1) Construct the NodeWatcher with a ZooKeeper object rather than
 constructing one. Not only does it simplify NodeWatcher, but it also makes
 it so that the ZooKeeper object can be used for other things as well.

 I hear you I was thinking that this might not be a good idea because
 NodeWatcher can reconnect you to the ensemble if it goes offline.

 I'm not sure if it's a bug or not but once my session expired on the client
 it wouldn't reconnect so I just implemented my own reconnect and session
 expiry.


 2) Use the async API in watchNodeData and watchNodeExists. it simplifies
 the code and the error handling.

 The problem was that according to feedback here an async request might never
 return if the server dies shortly after the request and before it has a
 change to respond.

 I wanted NodeWatcher to hide as much rope as possible.


 3) You don't need to do a connect() in handleDisconnected(). ZooKeeper
 object will do it automatically for you.


 I can try again if you'd like by this isn't my experience.  Once the session
 expired and the whole ensemble was offline it wouldn't connect again.

 If it was a transient disconnect I'd see on disconnect event and then a
 quick reconnect.  If it was a long disconnect (with nothing to attach to)
 then ZK won't ever reconnect me.

 I'd like this to be the behavior though...


 There is an old example on sourceforge
 http://zookeeper.wiki.sourceforge.net/ZooKeeperJavaExample that may give
 you some more ideas on how to simplify your code.

 That would be nice simple is good!

 Kevin


 --
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 AIM/YIM: sfburtonator
 Skype: burtonator
 Work: http://spinn3r.com



RE: Distributed queue: how to ensure no lost items?

2009-01-08 Thread Benjamin Reed
We should expand that section. the current queue recipe guarantees that things 
are consumed at most once. to guarantee at least the consumer creates an 
ephemeral node queue-X-inprocess to indicate that the node is being processed. 
once the queue element has been processed the consumer deletes queue-X and 
queue-X-inprocess (in that order).

using an emphemeral node means that if a consumer crashes, the *-inprocess node 
will be deleted allowing the queue elements it was working on to be consumed by 
someone else. putting the *-inprocess nodes at the same level of the queue-X 
nodes allows the consumer to get the list of queue elements and the inprocess 
flags with the same getChildren call. the *-inprocess flag ensures that only 
one consumer is processing a given item. by deleting queue-X before 
queue-X-inprocess we make sure that no other consumer will see queue-X as 
available for consumption after it is processed and before it is deleted.

this is at last once, because the consumer has a race condition. the consumer 
may process the item and then crash before it can delete the corresponding 
queue-X node.

ben

-Original Message-
From: Stuart White [mailto:stuart.whi...@gmail.com] 
Sent: Thursday, January 08, 2009 7:15 AM
To: zookeeper-user@hadoop.apache.org
Subject: Distributed queue: how to ensure no lost items?

I'm interested in using ZooKeeper to provide a distributed
producer/consumer queue for my distributed application.

Of course I've been studying the recipes provided for queues, barriers, etc...

My question is: how can I prevent packets of work from being lost if a
process crashes?

For example, following the distributed queue recipe, when a consumer
takes an item from the queue, it removes the first item znode under
the queue znode.  But, if the consumer immediately crashes after
removing the item from the queue, that item is lost.

Is there a recipe or recommended approach to ensure that no queue
items are lost in the event of process failure?

Thanks!


RE: Can ConnectionLossException be thrown when using multiple hosts?

2009-01-08 Thread Benjamin Reed
just to clarify: you also get ConnectionLossException from syncronous requests 
if the request cannot be sent or no response is received.

ben

-Original Message-
From: Patrick Hunt [mailto:ph...@apache.org] 
Sent: Wednesday, January 07, 2009 10:16 AM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Can ConnectionLossException be thrown when using multiple hosts?

There are basically 2 cases where you can see connectionloss:

1) you call an operation on a session that is no longer alive

2) you are disconnected from a server when there are pending async 
operations to that server (you made an async request which has not yet 
completed)

Patrick

Kevin Burton wrote:
 Can this be thrown when using multiple servers as long as  1 of them is
 online?
 Trying to figure out of I should try some type of reconnect if a single
 machine fails instead of failing altogether.
 
 Kevin
 


RE: Sending data during NodeDataChanged or NodeCreated

2009-01-08 Thread Benjamin Reed
if you do a getData(/a, true) and then /a changes, you will get a watch 
event. if /a changes again, you will not get an event. so, if you want to 
monitor /a, you need to do a new getData() after each watch event to 
reregister the watch and get the new value. (re-registering watches on 
reconnect is a different issue. there are no disconnects in this example.)

you are correct that zookeeper has some subtle things to watch out for. that is 
why we do not want to add more.

ben

-Original Message-
From: burtona...@gmail.com [mailto:burtona...@gmail.com] On Behalf Of Kevin 
Burton
Sent: Thursday, January 08, 2009 11:58 AM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Sending data during NodeDataChanged or NodeCreated



 while the case in which a value only changes once, can be made slightly
 more optimal by passing the value in the watch event. it is not worth the
 risk. in our experience we had a application that was able to make that
 assumption initially and then later when the assumption became invalid it
 was very hard to diagnose.


I don't quite follow.  In this scenario you would be sent two events, with
two pieces of data.

If ZK re-registers watches on reconnect, I don't see how it could be easier
than this.


 we don't want to make zookeeper harder to use by introducing mechanisms
 that only work with subtle assumptions.


I definitely think ZK has too much rope right now.  It's far too easy to
make mistakes and there are lots of subtle undocumented behaviors.
Kevin

-- 
Founder/CEO Spinn3r.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com


RE: Sending data during NodeDataChanged or NodeCreated

2009-01-07 Thread Benjamin Reed
This is the behavior we had when we first implemented the API, and in every 
case where people used the information there was a bug. it is virtually 
impossible to use correctly. In general I'm all for giving people rope, but if 
it always results in death, you should stop handing it out.

In your example, if the ACL changed and then the data changed, we would have a 
security hole if we sent the data with the watch.

ben

From: burtona...@gmail.com [burtona...@gmail.com] On Behalf Of Kevin Burton 
[bur...@spinn3r.com]
Sent: Tuesday, January 06, 2009 4:39 PM
To: zookeeper-user@hadoop.apache.org
Subject: Sending data during NodeDataChanged or NodeCreated

So if I understand this correctly, if I receive a NodeDataChanged event, and
then attempt do do a read of that node, there's a race condition where the
server could crash and I would be disconnected and my read would hit an
Exception
Or, the ACL could change and I no longer have permission to read the file
(though I did for a short window).

. now I have to add all this logic to retry.  Are there any other race
conditions I wonder.

Why not just send the byte[] data during the NodeDataChanged or NodeCreated
event from the server?  This would avoid all these issues.

It's almost certainly what the user wants anyway.

Kevin

--
Founder/CEO Spinn3r.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com


RE: Simpler ZooKeeper event interface....

2009-01-07 Thread Benjamin Reed
when you shutdown the full ensemble the session isn't expired. when things come 
back up your session will still be active. (it would be bad if the zk service 
could not survive the bounce of an ensembel.)

you are way over thinking this and i fear you are not helping yourself with 
trying to second guess with timers. zookeeper is structured such it can be used 
as ground truth. trying to second guess will only bring you headache.

ben

From: burtona...@gmail.com [burtona...@gmail.com] On Behalf Of Kevin Burton 
[bur...@spinn3r.com]
Sent: Wednesday, January 07, 2009 3:36 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Simpler ZooKeeper event interface



 Here's a good reason for each client to know it's session status
 (connected/disconnected/expired). Depending on the application, if L does
 not have a connected session to the ensemble it may need to be careful how
 it acts.


connected/disconnected events are given out in the current API but when I
shutdown the full ensemble I don't receive a session expired.

I'm considering implementing my own session expiration by tracking how long
I've been disconnected.

Kevin

--
Founder/CEO Spinn3r.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com


  1   2   >