Re: How to reestablish a session
ah i see. you are manually reestablishing the connection to B using the session identifier for the session with A. the problem is that when you call close on a session, it kills the session. we don't really have a way to close a handle without do that. (actually there is a test class that does it in java.) if you want this, you should open a jira to do a close() without killing the session. why don't you let the client library do the move for you? ben On 11/18/2010 11:51 AM, Gustavo Niemeyer wrote: Hi Ben, that quote is a bit out of context. it was with respect to a proposed change. My point was just that the reasoning why you believed it wasn't a good approach to kill ephemerals in that old instance applies to the new cases I'm pointing out. I wasn't suggesting you agreed with my new reasoning upfront. in your scenario can you explain step 4)? what are you closing? I'm closing the old ZooKeeper handler (zh), after a new one was established with the same client id.
Re: Running cluster behind load balancer
one thing to note: the if you are using a DNS load balancer, some load balancers will return the list of resolved addresses in different orders to do the balancing. the zookeeper client will shuffle that list before it it used, so in reality, using a single DNS hostname resolving to all the server addresses will probably work just as well as most DNS-based load balancers. ben On 11/04/2010 08:26 AM, Patrick Hunt wrote: Hi Chang, thanks for the insights, if you have a few minutes would you mind updating the FAQ with some of this detail? http://wiki.apache.org/hadoop/ZooKeeper/FAQ Thanks! Patrick On Thu, Nov 4, 2010 at 6:27 AM, Chang Songtru64...@me.com wrote: Sorry. I made a mistake on retry timeout in load balancer section of my answer. The same timeout applies to load balancer case as well (depends on the recv timeout) Thank you Chang On Nov 4, 2010, at 10:22 PM, Chang Song wrote: I would like to add some info on this. This may not be very important, but there are subtle differences. Two cases: 1. server hardware failure or kernel panic 2. zookeeper Java daemon process down In former one, timeout will be based on the timeout argument in zookeeper_init(). Partially based on ZK heartbeat algorithm. It recognize server down in 2/3 of the timeout. then retries at every timeout. For example, if timeout is 9000 msec, it first times out in 6 second, and retries every 9 seconds. In latter case (Java process down), since socket connect immediately returns refused connection, it can retry immediately. On top of that, - Hardware load balancer: If an ensemble cluster is serviced with hardware load balancer, zookeeper client will retry every 2 second since we only have one IP to try. - DNS RR: Make sure that nscd on your linux box is off since it is most likely that DNS cache returns the same IP many times. This is actually worse than above since ZK client will retry the same dead server every 2 seconds for some time. I think it is best not to use load balancer for ZK clients since ZK clients will try next server immediately if previous one fails for some reason (based on timeout above). And this is especially true if your cluster works in pseudo realtime environment where tickTime is set to very low. Chang On Nov 4, 2010, at 9:17 AM, Ted Dunning wrote: DNS round-robin works as well. On Wed, Nov 3, 2010 at 3:45 PM, Benjamin Reedbr...@yahoo-inc.com wrote: it would have to be a TCP based load balancer to work with ZooKeeper clients, but other than that it should work really well. The clients will be doing heart beats so the TCP connections will be long lived. The client library does random connection load balancing anyway. ben On 11/03/2010 12:19 PM, Luka Stojanovic wrote: What would be expected behavior if a three node cluster is put behind a load balancer? It would ease deployment because all clients would be configured to target zookeeper.example.com regardless of actual cluster configuration, but I have impression that client-server connection is stateful and that jumping randomly from server to server could bring strange behavior. Cheers, -- Luka Stojanovic lu...@vast.com Platform Engineering
Re: Getting a node exists code on a sequence create
yes, i think you have summarized the problem nicely jeremy. i'm curious about your reasoning for running servers in standalone mode and then merging. can you explain that a bit more? thanx ben On 11/01/2010 04:51 PM, Jeremy Stribling wrote: I think this is caused by stupid behavior on our application's part, and the error message just confused me. Here's what I think is happening. 1) 3 servers are up and accepting data, creating sequential znodes under /zkrsm. 2) 1 server dies, the other 2 continue creating sequential znodes. 3) The 1st server restarts, but instead of joining the other 2 servers, it starts an instance by itself, knowing only about the znodes created before it died. [This is a bug in our application -- it is supposed to join the other 2 servers in their cluster.] 4) Another server (#2) dies and restarts, joining the cluster of server #1. It knows about more sequential znodes under /zkrsm than server #1. 5) At this point, trying to create a new znode in the #1-#2 cluster might be problematic, because servers #1 and #2 know about different sets of znode. If #1 allocates what it thinks is a new sequential number for a new znode, it could be one already used by server #2, and hence a node exists code might be returned. So, in summary, our application is almost certainly using Zookeeper wrong. Sorry to waste time on the list, but maybe this thread can help someone in the future. (If this explanation sounds totally off-base though, let me know. I'm not 100% certain this is what's happening, but it definitely seems likely.) Thanks, Jeremy On 11/01/2010 02:56 PM, Jeremy Stribling wrote: Yes, every znode in /zkrsm was created with the sequence flag. We bring up a cluster of three nodes, though we do it in a slightly odd manner to support dynamism: each node starts up as a single-node instance knowing only itself, and then each node is contacted by a coordinator that kills the ZooKeeperServer object and starts a new QuorumPeer object using the full list of three servers. I know this is weird; perhaps this has something to do with it. Other than the weird setup behavior, we are just writing a few sequential records into the system (which all seems to work fine), killing one of the nodes (one that has been elected leader via the standard recommended ZK leader election algorithm), restarting it, and then trying to create more sequential znodes. I'm guessing this is pretty well-tested behavior, so there must be something weird or wrong about the way I have stuff setup. I'm happy to provide whatever logs or snapshots might help someone track this down. Thanks, Jeremy On 11/01/2010 02:42 PM, Benjamin Reed wrote: how were you able to reproduce it? all the znodes in /zkrsm were created with the sequence flag. right? ben On 11/01/2010 02:28 PM, Jeremy Stribling wrote: We were able to reproduce it. A stat on all three servers looks identical: [zk:ip:port(CONNECTED) 0] stat /zkrsm cZxid = 9 ctime = Mon Nov 01 13:01:57 PDT 2010 mZxid = 9 mtime = Mon Nov 01 13:01:57 PDT 2010 pZxid = 12884902218 cversion = 177 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0 dataLength = 0 numChildren = 177 Creating a sequential node through the command line also fails: [zk:ip:port(CONNECTED) 1] create -s /zkrsm/_record testdata Node already exists: /zkrsm/_record One potentially interesting thing is that numChildren above is 177, though I have sequence numbers on that record prefix up to 214 or so. There seem to be some gaps though -- I thin ls /zkrsm only shows about 177. Not sure if that's relevant or not. Thanks, Jeremy On 11/01/2010 12:06 PM, Jeremy Stribling wrote: Thanks for the reply. It happened every time we called create, not just once. More than that, we tried restarting each of the nodes in the system (one-by-one), including the new master, and the problem continued. Unfortunately we cleaned everything up, and it's not in that state anymore. We haven't yet tried to reproduce, but I will try and report back if I can get any cversion info. Jeremy On 11/01/2010 11:33 AM, Patrick Hunt wrote: Hi Jeremy, this sounds like a bug to me, I don't think you should be getting the nodeexists when the sequence flag is set. Looking at the code briefly we use the parent's cversion (incremented each time the child list is changed, added/removed). Did you see this error each time you called create, or just once? If you look at the cversion in the Stat of the znode /zkrsm on each of the servers what does it show? You can use the java CLI to connect to each of your servers and access this information. It would be interesting to see if the data was out of sync only for a short period of time, or forever. Is this repeatable? Ben/Flavio do you see anything here? Patrick On Thu, Oct 28, 2010 at 6:06 PM, Jeremy Striblingst...@nicira.com wrote: HI everyone, Is there any situation in which creating a new ZK node with the SEQUENCE flag should result
Re: Getting a node exists code on a sequence create
how were you able to reproduce it? all the znodes in /zkrsm were created with the sequence flag. right? ben On 11/01/2010 02:28 PM, Jeremy Stribling wrote: We were able to reproduce it. A stat on all three servers looks identical: [zk:ip:port(CONNECTED) 0] stat /zkrsm cZxid = 9 ctime = Mon Nov 01 13:01:57 PDT 2010 mZxid = 9 mtime = Mon Nov 01 13:01:57 PDT 2010 pZxid = 12884902218 cversion = 177 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0 dataLength = 0 numChildren = 177 Creating a sequential node through the command line also fails: [zk:ip:port(CONNECTED) 1] create -s /zkrsm/_record testdata Node already exists: /zkrsm/_record One potentially interesting thing is that numChildren above is 177, though I have sequence numbers on that record prefix up to 214 or so. There seem to be some gaps though -- I thin ls /zkrsm only shows about 177. Not sure if that's relevant or not. Thanks, Jeremy On 11/01/2010 12:06 PM, Jeremy Stribling wrote: Thanks for the reply. It happened every time we called create, not just once. More than that, we tried restarting each of the nodes in the system (one-by-one), including the new master, and the problem continued. Unfortunately we cleaned everything up, and it's not in that state anymore. We haven't yet tried to reproduce, but I will try and report back if I can get any cversion info. Jeremy On 11/01/2010 11:33 AM, Patrick Hunt wrote: Hi Jeremy, this sounds like a bug to me, I don't think you should be getting the nodeexists when the sequence flag is set. Looking at the code briefly we use the parent's cversion (incremented each time the child list is changed, added/removed). Did you see this error each time you called create, or just once? If you look at the cversion in the Stat of the znode /zkrsm on each of the servers what does it show? You can use the java CLI to connect to each of your servers and access this information. It would be interesting to see if the data was out of sync only for a short period of time, or forever. Is this repeatable? Ben/Flavio do you see anything here? Patrick On Thu, Oct 28, 2010 at 6:06 PM, Jeremy Striblingst...@nicira.com wrote: HI everyone, Is there any situation in which creating a new ZK node with the SEQUENCE flag should result in a node exists error? I'm seeing this happening after a failure of a ZK node that appeared to have been the master; when the new master takes over, my app is unable to create a new SEQUENCE node under an existing parent node. I'm using Zookeeper 3.2.2. Here's a representative log snippet: -- 3050756 [ProcessThread:-1] TRACE org.apache.zookeeper.server.PrepRequestProcessor - :Psessionid:0x12bf518350f0001 type:create cxid:0x4cca0691 zxid:0xfffe txntype:unknown /zkrsm/_record 3050756 [ProcessThread:-1] WARN org.apache.zookeeper.server.PrepRequestProcessor - Got exception when processing sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691 zxid:0xfffe txntype:unknown n/a org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:245) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) 3050756 [ProcessThread:-1] DEBUG org.apache.zookeeper.server.quorum.CommitProcessor - Processing request:: sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691 zxid:0x5027e txntype:-1 n/a 3050756 [ProcessThread:-1] DEBUG org.apache.zookeeper.server.quorum.Leader - Proposing:: sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691 zxid:0x5027e txntype:-1 n/a 3050756 [SyncThread:0] TRACE org.apache.zookeeper.server.quorum.Leader - Ack zxid: 0x5027e 3050757 [SyncThread:0] TRACE org.apache.zookeeper.server.quorum.Leader - outstanding proposal: 0x5027e 3050757 [SyncThread:0] TRACE org.apache.zookeeper.server.quorum.Leader - outstanding proposals all 3050757 [SyncThread:0] DEBUG org.apache.zookeeper.server.quorum.Leader - Count for zxid: 0x5027e is 1 3050757 [FollowerHandler-/172.16.0.28:48776] TRACE org.apache.zookeeper.server.quorum.Leader - Ack zxid: 0x5027e 3050757 [FollowerHandler-/172.16.0.28:48776] TRACE org.apache.zookeeper.server.quorum.Leader - outstanding proposal: 0x5027e 3050757 [FollowerHandler-/172.16.0.28:48776] TRACE org.apache.zookeeper.server.quorum.Leader - outstanding proposals all 3050757 [FollowerHandler-/172.16.0.28:48776] DEBUG org.apache.zookeeper.server.quorum.Leader - Count for zxid: 0x5027e is 2 3050757 [FollowerHandler-/172.16.0.28:48776] DEBUG org.apache.zookeeper.server.quorum.CommitProcessor - Committing request:: sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691 zxid:0x5027e txntype:-1 n/a 3050757 [CommitProcessor:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor - Processing request:: sessionid:0x12bf518350f0001 type:create
Re: Is it possible to read/write a ledger concurrently
in hedwig one hub does both the publish and subscribe for a given topic and therefore is the only processes reading and writing from/to a ledger, so there isn't an issue. The ReadAheadCache does read-ahead :) it is so that we can minimize latency when doing sequential reads. ben On 10/21/2010 11:30 PM, amit jaiswal wrote: Hi, How does Hedwig handles this scenario? Since only one of the hubs have the ownership of a topic, the same hub is able to serve both publish and subscribe requests concurrently. Is my understanding correct ? Also, what is the purpose of ReadAheadCache class in Hedwig? Is it used somewhere for this concurrent read/write problem? -regards Amit - Original Message From: Benjamin Reedbr...@yahoo-inc.com To: zookeeper-user@hadoop.apache.org Sent: Fri, 22 October, 2010 11:09:07 AM Subject: Re: Is it possible to read/write a ledger concurrently currently program1 can read and write to an open ledger, but program2 must wait for the ledger to be closed before doing the read. the problem is that program2 needs to know the last valid entry in the ledger. (there may be entries that may not yet be valid.) for performance reasons, only program1 knows the end. so you need a way to propagate that information. we have talked about a way to push the last entry into the bookkeeper handle. flavio was working on it, but i don't think it has been implemented. ben On 10/21/2010 10:22 PM, amit jaiswal wrote: Hi, In BookKeeper documentation, the sample program creates a ledger, writes some entries and then *closes* the ledger. Then a client program opens the ledger, and reads the entries from it. Is it possible for program1 to write to a ledger, and program2 to read from the ledger at the same time. In BookKeeper code, if a client tries to read from a ledger which is not being closed (as per its metadata in zk), then a recovery process is started to check for consistency. Waiting for ledger to get closed can introduce lot of latency at the client side. Can somebody explain this functionality? -regards Amit
Re: Is it possible to read/write a ledger concurrently
currently program1 can read and write to an open ledger, but program2 must wait for the ledger to be closed before doing the read. the problem is that program2 needs to know the last valid entry in the ledger. (there may be entries that may not yet be valid.) for performance reasons, only program1 knows the end. so you need a way to propagate that information. we have talked about a way to push the last entry into the bookkeeper handle. flavio was working on it, but i don't think it has been implemented. ben On 10/21/2010 10:22 PM, amit jaiswal wrote: Hi, In BookKeeper documentation, the sample program creates a ledger, writes some entries and then *closes* the ledger. Then a client program opens the ledger, and reads the entries from it. Is it possible for program1 to write to a ledger, and program2 to read from the ledger at the same time. In BookKeeper code, if a client tries to read from a ledger which is not being closed (as per its metadata in zk), then a recovery process is started to check for consistency. Waiting for ledger to get closed can introduce lot of latency at the client side. Can somebody explain this functionality? -regards Amit
Re: invalid acl for ZOO_CREATOR_ALL_ACL
which scheme are you using? ben On 10/18/2010 11:57 PM, FANG Yang wrote: 2010/10/19 FANG Yangfa...@douban.com hi, all I have a simple zk client written by c ,which is attachment #1. When i use ZOO_CREATOR_ALL_ACL, the ret code of zoo_create is -114((Invalid ACL specified definde in zookeeper.h)), but after i replace it with ZOO_OPEN_ACL_UNSAFE, it work. Zookeeper Programmer's Guide mention that CREATE_ALL_ACL grants all permissions to the creator of the node. The creator must have been authenticated by the server (for example, using “digest” scheme) before it can create nodes with this ACL. I call zoo_add_auth accroding to func testAuth of TestClient.cc in src/c/tests in my source code, but it doesn't work, ret code is still -114 . Would you guys do me a favor, plz? -- 方阳 FANG Yang 开发工程师 Software Engineer Douban Inc. msn:franklin.f...@hotmail.com gtalk:franklin.f...@gmail.com skype:franklin.fang No.14 Jiuxianqiao Road, Area 51 A1-1-2016, Beijing 100016 , China 北京市酒仙桥路14号51楼A1区1门2016,100016 And my zk version is 3.2.2, mode is standalone.
Re: zxid integer overflow
we should put in a test for that. it is certainly a plausible scenario. in theory it will just flow into the next epoch and everything will be fine, but we should try it and see. ben On 10/19/2010 11:33 AM, Sandy Pratt wrote: Just as a thought experiment, I was pondering the following: ZK stamps each change to its managed state with a zxid (http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperInternals.html). That ID consists of a 64 bit number in which the upper 32 bits are the epoch, which changes when the leader does, and the bottom 32 bits are a counter, which is incremented by the leader with every change. If 1000 changes are made to ZK state each second (which is 1/20th of the peak rate advertised), then the counter portion will roll over in 2^32 / (86400 * 1000) = 49 days. Now, assuming that my math is correct, is this an actual concern? For example, if I'm using ZK to provide locking for a key value store that handles transactions at about that rate, am I setting myself up for failure? Thanks, Sandy
Re: What does this mean?
how big is your data? you may be running into the problem where it takes too long to do the state transfer and times out. check the initLimit and the size of your data. ben On 10/10/2010 08:57 AM, Avinash Lakshman wrote: Thanks Ben. I am not mixing processes of different clusters. I just double checked that. I have ZK deployed in a 5 node cluster and I have 20 observers. I just started the 5 node cluster w/o starting the observers. I still the same issue. Now my cluster won't start up. So what is the correct workaround to get this going? How can I find out who the leader is and who the follower to get more insight? Thanks A On Sun, Oct 10, 2010 at 8:33 AM, Benjamin Reedbr...@yahoo-inc.com wrote: this usually happens when a follower closes its connection to the leader. it is usually caused by the follower shutting down or failing. you may get further insight by looking at the follower logs. you should really run with timestamps on so that you can correlate the logs of the leader and follower. on thing that is strange is the wide divergence between zxid of follower and leader. are you mixing processes of different clusters? ben From: Avinash Lakshman [avinash.laksh...@gmail.com] Sent: Sunday, October 10, 2010 8:18 AM To: zookeeper-user Subject: What does this mean? I see this exception and the servers not doing anything. java.io.IOException: Channel eof at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:630) ERROR - 124554051584(higestZxid) 21477836646(next log) for type -11 WARN - Sending snapshot last zxid of peer is 0xe zxid of leader is 0x1e WARN - Sending snapshot last zxid of peer is 0x18 zxid of leader is 0x1eg WARN - Sending snapshot last zxid of peer is 0x5002dc766 zxid of leader is 0x1e WARN - Sending snapshot last zxid of peer is 0x1c zxid of leader is 0x1e ERROR - Unexpected exception causing shutdown while sock still open java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:78) at java.io.DataOutputStream.writeInt(DataOutputStream.java:180) at org.apache.jute.BinaryOutputArchive.writeInt(BinaryOutputArchive.java:55) at org.apache.zookeeper.data.StatPersisted.serialize(StatPersisted.java:116) at org.apache.zookeeper.server.DataNode.serialize(DataNode.java:167) at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:967) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982) at org.apache.zookeeper.server.DataTree.serialize(DataTree.java:1031) at org.apache.zookeeper.server.util.SerializeUtils.serializeSnapshot(SerializeUtils.java:104) at org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:426) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:331) WARN - *** GOODBYE /10.138.34.212:33272 Avinash
RE: What does this mean?
this usually happens when a follower closes its connection to the leader. it is usually caused by the follower shutting down or failing. you may get further insight by looking at the follower logs. you should really run with timestamps on so that you can correlate the logs of the leader and follower. on thing that is strange is the wide divergence between zxid of follower and leader. are you mixing processes of different clusters? ben From: Avinash Lakshman [avinash.laksh...@gmail.com] Sent: Sunday, October 10, 2010 8:18 AM To: zookeeper-user Subject: What does this mean? I see this exception and the servers not doing anything. java.io.IOException: Channel eof at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:630) ERROR - 124554051584(higestZxid) 21477836646(next log) for type -11 WARN - Sending snapshot last zxid of peer is 0xe zxid of leader is 0x1e WARN - Sending snapshot last zxid of peer is 0x18 zxid of leader is 0x1eg WARN - Sending snapshot last zxid of peer is 0x5002dc766 zxid of leader is 0x1e WARN - Sending snapshot last zxid of peer is 0x1c zxid of leader is 0x1e ERROR - Unexpected exception causing shutdown while sock still open java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:78) at java.io.DataOutputStream.writeInt(DataOutputStream.java:180) at org.apache.jute.BinaryOutputArchive.writeInt(BinaryOutputArchive.java:55) at org.apache.zookeeper.data.StatPersisted.serialize(StatPersisted.java:116) at org.apache.zookeeper.server.DataNode.serialize(DataNode.java:167) at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:967) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982) at org.apache.zookeeper.server.DataTree.serialize(DataTree.java:1031) at org.apache.zookeeper.server.util.SerializeUtils.serializeSnapshot(SerializeUtils.java:104) at org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:426) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:331) WARN - *** GOODBYE /10.138.34.212:33272 Avinash
Re: Question on production readiness, deployment, data of BookKeeper / Hedwig
your guess is correct :) for bookkeeper and hedwig we released early to do the development in public. originally we developed bookkeeper as a distributed write ahead log for the NameNode in HDFS, but while we were able to get a proof of concept going, the structure of the code of the NameNode makes it difficulty to integrate well. we are currently working on fixing the write ahead layer of the NameNode, which is taking a lot of time. in the meantime we applied bookkeeper to pub/sub and came up with hedwig, which is where most of our efforts are focused while the slow processing of pushing changes to the NameNode proceeds. ben On 10/08/2010 02:32 PM, Jake Mannix wrote: Hi Ben, To follow up with this question, which seems to be asking primarily about Hedwig (and I guess the answer is: it's not in production yet, anywhere), with one more about Bookkeeper: is BookKeeper used in production as a WAL (or for any other use) anywhere? If so, for what uses? Any info (even anecdotal) would be great! -jake On Thu, Oct 7, 2010 at 9:15 AM, Benjamin Reedbr...@yahoo-inc.com wrote: hi amit, sorry for the late response. this week has been crunch time for a lot of different things. here are your answers: production 1. it is still in prototype phase. we are evaluating different aspects, but there is still some work to do to make it production ready. we also need to get an engineering team to signup to stand behind it. 2. it's a generic pub/sub message bus. in some sense it is really a datacenter solution with extensions for multi-data center operation, so it is perfectly reasonable to use it in a single datacenter setting. 3. yeah, we have removed the hw.bash script. it had some hardcoded assumptions and was a swiss army knife on steroids. he have been breaking it up into simpler scripts. 4. session expiry really represents a fundamental connectivity problem, so both bk and hedwig restart the component that gets the expired session errror. data 1. yes. 2. once all subscribers have consumed a message there is a background process that cleans it up. 3. yes there is a replication factor and we ensure replication on writes and there is a recovery tool to recover bookies that fail. we don't have to worry about conflicts because there is only a single writer for a give ledger. because of this we do not need to do quorum reads. documentation yes, this is something we need to work on. i'll see if i can push out some of our hello world applications. we'd also like to put a JMS API on top so that the API is more familiar (and documented :). i don't want to delay the answers to your other questions, so let me answer that HedwigSubscriber is the class for clients. the other classes are internal. (for cross data center hubs use a special kind of subscriptions to do cross data center updates.) ben On 10/05/2010 10:32 PM, amit jaiswal wrote: Hi, In Hedwig talk (http://vimeo.com/13282102), it was mentioned that the primary use case for Hedwig comes from the distributed key-value store PNUTS in Yahoo!, but also said that the work is new. Could you please about the following: Production readiness / Deployment 1. What is the production readiness of Hedwig / BookKeeper. Is it being used anywhere (like in PNUTS)? 2. Is Hedwig designed to use as a generic message bus or only for multi-datacenter operations? 3. Hedwig installation and deployment is done through a script hw.bash, but that is difficult to use especially in a production environment. Are there any other packages available that can simplify the deployment of hedwig. 4. How does BK/Hedwig handle zookeeper session expiry? Data Deletion, Handling data loss, Quorum 1. Does BookKeeper support deletion of old log entries which have been consumed. 2. How does Hedwig handles the case when all subscribers have consumed all the messages. In the talk, it was said that a subscriber can come back after hours, days or weeks. Is there any data retention / expiration policy for the data that is published? 3. How does Hedwig handles data loss? There is a replication factor, and a write operation must be accepted by majority of the bookies, but how data conflicts are handled? Is there any possibility of data conflict at all? Is the replication only for recovery? When the hub is reading data from bookies, does it reads from all the bookies to satisfy quorum read? Code What is the difference between PubSubServer, HedwigSubscriber, HedwigHubSubscriber. Is there any HelloWorld program that simply illustrates how to instantiate a hedwig client, and publish/consume messages. (HedwigBenchmark class is helpful, but was looking something like API documentation). -regards Amit
Re: Question on production readiness, deployment, data of BookKeeper / Hedwig
hi amit, sorry for the late response. this week has been crunch time for a lot of different things. here are your answers: production 1. it is still in prototype phase. we are evaluating different aspects, but there is still some work to do to make it production ready. we also need to get an engineering team to signup to stand behind it. 2. it's a generic pub/sub message bus. in some sense it is really a datacenter solution with extensions for multi-data center operation, so it is perfectly reasonable to use it in a single datacenter setting. 3. yeah, we have removed the hw.bash script. it had some hardcoded assumptions and was a swiss army knife on steroids. he have been breaking it up into simpler scripts. 4. session expiry really represents a fundamental connectivity problem, so both bk and hedwig restart the component that gets the expired session errror. data 1. yes. 2. once all subscribers have consumed a message there is a background process that cleans it up. 3. yes there is a replication factor and we ensure replication on writes and there is a recovery tool to recover bookies that fail. we don't have to worry about conflicts because there is only a single writer for a give ledger. because of this we do not need to do quorum reads. documentation yes, this is something we need to work on. i'll see if i can push out some of our hello world applications. we'd also like to put a JMS API on top so that the API is more familiar (and documented :). i don't want to delay the answers to your other questions, so let me answer that HedwigSubscriber is the class for clients. the other classes are internal. (for cross data center hubs use a special kind of subscriptions to do cross data center updates.) ben On 10/05/2010 10:32 PM, amit jaiswal wrote: Hi, In Hedwig talk (http://vimeo.com/13282102), it was mentioned that the primary use case for Hedwig comes from the distributed key-value store PNUTS in Yahoo!, but also said that the work is new. Could you please about the following: Production readiness / Deployment 1. What is the production readiness of Hedwig / BookKeeper. Is it being used anywhere (like in PNUTS)? 2. Is Hedwig designed to use as a generic message bus or only for multi-datacenter operations? 3. Hedwig installation and deployment is done through a script hw.bash, but that is difficult to use especially in a production environment. Are there any other packages available that can simplify the deployment of hedwig. 4. How does BK/Hedwig handle zookeeper session expiry? Data Deletion, Handling data loss, Quorum 1. Does BookKeeper support deletion of old log entries which have been consumed. 2. How does Hedwig handles the case when all subscribers have consumed all the messages. In the talk, it was said that a subscriber can come back after hours, days or weeks. Is there any data retention / expiration policy for the data that is published? 3. How does Hedwig handles data loss? There is a replication factor, and a write operation must be accepted by majority of the bookies, but how data conflicts are handled? Is there any possibility of data conflict at all? Is the replication only for recovery? When the hub is reading data from bookies, does it reads from all the bookies to satisfy quorum read? Code What is the difference between PubSubServer, HedwigSubscriber, HedwigHubSubscriber. Is there any HelloWorld program that simply illustrates how to instantiate a hedwig client, and publish/consume messages. (HedwigBenchmark class is helpful, but was looking something like API documentation). -regards Amit
Re: Zookeeper on 60+Gb mem
you will need to time how long it takes to read all that state back in and adjust the initTime accordingly. it will probably take a while to pull all that data into memory. ben On 10/05/2010 11:36 AM, Avinash Lakshman wrote: I have run it over 5 GB of heap with over 10M znodes. We will definitely run it with over 64 GB of heap. Technically I do not see any limitiation. However I will the experts chime in. Avinash On Tue, Oct 5, 2010 at 11:14 AM, Mahadev Konarmaha...@yahoo-inc.comwrote: Hi Maarteen, I definitely know of a group which uses around 3GB of memory heap for zookeeper but never heard of someone with such huge requirements. I would say it definitely would be a learning experience with such high memory which I definitely think would be very very useful for others in the community as well. Thanks mahadev On 10/5/10 11:03 AM, Maarten Koopmansmaar...@vrijheid.net wrote: Hi, I just wondered: has anybody ever ran zookeeper to the max on a 68GB quadruple extra large high memory EC2 instance? With, say, 60GB allocated or so? Because EC2 with EBS is a nice way to grow your zookeeper cluster (data on the ebs columes, upgrade as your memory utilization grows) - I just wonder what the limits are there, or if I am foing where angels fear to tread... --Maarten
Re: ZK compatability
we should also point out that our ops guys here at yahoo! don't like the break at major clause. i imagine when we do the next major release we will try to be one release backwards compatible. (although we shouldn't promise it until we successfully do it once :) ben On 09/30/2010 10:29 AM, Patrick Hunt wrote: Historically major releases can have non-bw compatible changes. However if you look back through the release history you'll see that the last time that happened was oct 2008, when we moved the project from sourceforge to apache. Patrick On Tue, Sep 28, 2010 at 11:37 AM, Jun Raojun...@gmail.com wrote: What about major releases going forward? Thanks, Jun On Mon, Sep 27, 2010 at 10:32 PM, Patrick Huntph...@apache.org wrote: In general yes, minor and bug fix releases are fully backward compatible. Patrick On Sun, Sep 26, 2010 at 9:11 PM, Jun Raojun...@gmail.com wrote: Hi, Does ZK support (and plan to support in the future) backward compatibility (so that a new client can talk to an old server and vice versa)? Thanks Jun
Re: closing session on socket close vs waiting for timeout
the problem is that followers don't track session timeouts. they track when they last heard from the sessions that are connected to them and they periodically propagate this information to the leader. the leader is the one that expires the session. your technique only works when the client is connected to the leader. one thing you can do is generate a close request for the socket and push that through the system. that will cause it to get propagated through the followers and processed at the leader. it would also allow you to get your functionality without touching the processing pipeline. the thing that worries me about this functionality in general is that network anomalies can cause a whole raft of sessions to get expired in this way. for example, you have 3 servers with load spread well; there is a networking glitch that cause clients to abandon a server; suddenly 1/3 of your clients will get expired sessions. ben On 09/10/2010 12:17 PM, Fournier, Camille F. [Tech] wrote: Ben, could you explain a bit more why you think this won't work? I'm trying to decide if I should put in the work to take the POC I wrote and complete it, but I don't really want to waste my time if there's a fundamental reason it's a bad idea. Thanks, Camille -Original Message- From: Benjamin Reed [mailto:br...@yahoo-inc.com] Sent: Wednesday, September 08, 2010 4:03 PM To: zookeeper-user@hadoop.apache.org Subject: Re: closing session on socket close vs waiting for timeout unfortunately, that only works on the standalone server. ben On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote: This would be the ideal solution to this problem I think. Poking around the (3.3) code to figure out how hard it would be to implement, I figure one way to do it would be to modify the session timeout to the min session timeout and touch the connection before calling close when you get certain exceptions in NIOServerCnxn.doIO. I did this (removing the code in touch session that returns if the tickTime is greater than the expire time) and it worked (in the standalone server anyway). Interesting solution, or total hack that will not work beyond most basic test case? C (forgive lack of actual code in this email) -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Tuesday, September 07, 2010 1:11 PM To: zookeeper-user@hadoop.apache.org Cc: Benjamin Reed Subject: Re: closing session on socket close vs waiting for timeout This really is, just as Ben says a problem of false positives and false negatives in detecting session expiration. On the other hand, the current algorithm isn't really using all the information available. The current algorithm is using time since last client initiated heartbeat. The new proposal is somewhat worse in that it proposes to use just the boolean has-TCP-disconnect-happened. Perhaps it would be better to use multiple features in order to decrease both false positives and false negatives. For instance, I could imagine that we use the following features: - time since last client hearbeat or disconnect or reconnect - what was the last event? (a heartbeat or a disconnect or a reconnect) Then the expiration algorithm could use a relatively long time since last heartbeat and a relatively short time since last disconnect to mark a session as disconnected. Wouldn't this avoid expiration during GC and cluster partition and cause expiration quickly after a client disconnect? On Mon, Sep 6, 2010 at 11:26 PM, Patrick Huntph...@apache.org wrote: That's a good point, however with suitable documentation, warnings and such it seems like a reasonable feature to provide for those users who require it. Used in moderation it seems fine to me. Perhaps we also make it configurable at the server level for those administrators/ops who don't want to deal with it (disable the feature entirely, or only enable on particular servers, etc...). Patrick On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reedbr...@yahoo-inc.com wrote: if this mechanism were used very often, we would get a huge number of session expirations when a server fails. you are trading fast error detection for the ability to tolerate temporary network and server outages. to be honest this seems like something that in theory sounds like it will work in practice, but once deployed we start getting session expirations for cases that we really do not want or expect. ben On 09/01/2010 12:47 PM, Patrick Hunt wrote: Ben, in this case the session would be tied directly to the connection, we'd explicitly deny session re-establishment for this session type (so 4 would fail). Would that address your concern, others? Patrick On 09/01/2010 10:03 AM, Benjamin Reed wrote: i'm a bit skeptical that this is going to work out properly. a server may receive a socket reset even though the client is still alive: 1) client sends a request to a server 2) client is partitioned from the server 3
Re: closing session on socket close vs waiting for timeout
ah dang, i should have said generate a close request for the session and push that through the system. ben On 09/10/2010 01:01 PM, Benjamin Reed wrote: the problem is that followers don't track session timeouts. they track when they last heard from the sessions that are connected to them and they periodically propagate this information to the leader. the leader is the one that expires the session. your technique only works when the client is connected to the leader. one thing you can do is generate a close request for the socket and push that through the system. that will cause it to get propagated through the followers and processed at the leader. it would also allow you to get your functionality without touching the processing pipeline. the thing that worries me about this functionality in general is that network anomalies can cause a whole raft of sessions to get expired in this way. for example, you have 3 servers with load spread well; there is a networking glitch that cause clients to abandon a server; suddenly 1/3 of your clients will get expired sessions. ben On 09/10/2010 12:17 PM, Fournier, Camille F. [Tech] wrote: Ben, could you explain a bit more why you think this won't work? I'm trying to decide if I should put in the work to take the POC I wrote and complete it, but I don't really want to waste my time if there's a fundamental reason it's a bad idea. Thanks, Camille -Original Message- From: Benjamin Reed [mailto:br...@yahoo-inc.com] Sent: Wednesday, September 08, 2010 4:03 PM To: zookeeper-user@hadoop.apache.org Subject: Re: closing session on socket close vs waiting for timeout unfortunately, that only works on the standalone server. ben On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote: This would be the ideal solution to this problem I think. Poking around the (3.3) code to figure out how hard it would be to implement, I figure one way to do it would be to modify the session timeout to the min session timeout and touch the connection before calling close when you get certain exceptions in NIOServerCnxn.doIO. I did this (removing the code in touch session that returns if the tickTime is greater than the expire time) and it worked (in the standalone server anyway). Interesting solution, or total hack that will not work beyond most basic test case? C (forgive lack of actual code in this email) -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Tuesday, September 07, 2010 1:11 PM To: zookeeper-user@hadoop.apache.org Cc: Benjamin Reed Subject: Re: closing session on socket close vs waiting for timeout This really is, just as Ben says a problem of false positives and false negatives in detecting session expiration. On the other hand, the current algorithm isn't really using all the information available. The current algorithm is using time since last client initiated heartbeat. The new proposal is somewhat worse in that it proposes to use just the boolean has-TCP-disconnect-happened. Perhaps it would be better to use multiple features in order to decrease both false positives and false negatives. For instance, I could imagine that we use the following features: - time since last client hearbeat or disconnect or reconnect - what was the last event? (a heartbeat or a disconnect or a reconnect) Then the expiration algorithm could use a relatively long time since last heartbeat and a relatively short time since last disconnect to mark a session as disconnected. Wouldn't this avoid expiration during GC and cluster partition and cause expiration quickly after a client disconnect? On Mon, Sep 6, 2010 at 11:26 PM, Patrick Huntph...@apache.orgwrote: That's a good point, however with suitable documentation, warnings and such it seems like a reasonable feature to provide for those users who require it. Used in moderation it seems fine to me. Perhaps we also make it configurable at the server level for those administrators/ops who don't want to deal with it (disable the feature entirely, or only enable on particular servers, etc...). Patrick On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reedbr...@yahoo-inc.comwrote: if this mechanism were used very often, we would get a huge number of session expirations when a server fails. you are trading fast error detection for the ability to tolerate temporary network and server outages. to be honest this seems like something that in theory sounds like it will work in practice, but once deployed we start getting session expirations for cases that we really do not want or expect. ben On 09/01/2010 12:47 PM, Patrick Hunt wrote: Ben, in this case the session would be tied directly to the connection, we'd explicitly deny session re-establishment for this session type (so 4 would fail). Would that address your concern, others? Patrick On 09/01/2010 10:03 AM, Benjamin Reed wrote: i'm a bit skeptical that this is going to work out properly. a server may
Re: closing session on socket close vs waiting for timeout
unfortunately, that only works on the standalone server. ben On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote: This would be the ideal solution to this problem I think. Poking around the (3.3) code to figure out how hard it would be to implement, I figure one way to do it would be to modify the session timeout to the min session timeout and touch the connection before calling close when you get certain exceptions in NIOServerCnxn.doIO. I did this (removing the code in touch session that returns if the tickTime is greater than the expire time) and it worked (in the standalone server anyway). Interesting solution, or total hack that will not work beyond most basic test case? C (forgive lack of actual code in this email) -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Tuesday, September 07, 2010 1:11 PM To: zookeeper-user@hadoop.apache.org Cc: Benjamin Reed Subject: Re: closing session on socket close vs waiting for timeout This really is, just as Ben says a problem of false positives and false negatives in detecting session expiration. On the other hand, the current algorithm isn't really using all the information available. The current algorithm is using time since last client initiated heartbeat. The new proposal is somewhat worse in that it proposes to use just the boolean has-TCP-disconnect-happened. Perhaps it would be better to use multiple features in order to decrease both false positives and false negatives. For instance, I could imagine that we use the following features: - time since last client hearbeat or disconnect or reconnect - what was the last event? (a heartbeat or a disconnect or a reconnect) Then the expiration algorithm could use a relatively long time since last heartbeat and a relatively short time since last disconnect to mark a session as disconnected. Wouldn't this avoid expiration during GC and cluster partition and cause expiration quickly after a client disconnect? On Mon, Sep 6, 2010 at 11:26 PM, Patrick Huntph...@apache.org wrote: That's a good point, however with suitable documentation, warnings and such it seems like a reasonable feature to provide for those users who require it. Used in moderation it seems fine to me. Perhaps we also make it configurable at the server level for those administrators/ops who don't want to deal with it (disable the feature entirely, or only enable on particular servers, etc...). Patrick On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reedbr...@yahoo-inc.com wrote: if this mechanism were used very often, we would get a huge number of session expirations when a server fails. you are trading fast error detection for the ability to tolerate temporary network and server outages. to be honest this seems like something that in theory sounds like it will work in practice, but once deployed we start getting session expirations for cases that we really do not want or expect. ben On 09/01/2010 12:47 PM, Patrick Hunt wrote: Ben, in this case the session would be tied directly to the connection, we'd explicitly deny session re-establishment for this session type (so 4 would fail). Would that address your concern, others? Patrick On 09/01/2010 10:03 AM, Benjamin Reed wrote: i'm a bit skeptical that this is going to work out properly. a server may receive a socket reset even though the client is still alive: 1) client sends a request to a server 2) client is partitioned from the server 3) server starts trying to send response 4) client reconnects to a different server 5) partition heals 6) server gets a reset from client at step 6 i don't think you want to delete the ephemeral nodes. ben On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote: Yes that's right. Which network issues can cause the socket to close without the initiating process closing the socket? In my limited experience in this area network issues were more prone to leave dead sockets open rather than vice versa so I don't know what to look out for. Thanks, Camille -Original Message- From: Dave Wright [mailto:wrig...@gmail.com] Sent: Tuesday, August 31, 2010 1:14 PM To: zookeeper-user@hadoop.apache.org Subject: Re: closing session on socket close vs waiting for timeout I think he's saying that if the socket closes because of a crash (i.e. not a normal zookeeper close request) then the session stays alive until the session timeout, which is of course true since ZK allows reconnection and resumption of the session in case of disconnect due to network issues. -Dave Wright On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com wrote: That doesn't sound right to me. Is there a Zookeeper expert in the house? On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech] camille.fourn...@gs.com wrote: I foolishly did not investigate the ZK code closely enough and it seems
Re: closing session on socket close vs waiting for timeout
if this mechanism were used very often, we would get a huge number of session expirations when a server fails. you are trading fast error detection for the ability to tolerate temporary network and server outages. to be honest this seems like something that in theory sounds like it will work in practice, but once deployed we start getting session expirations for cases that we really do not want or expect. ben On 09/01/2010 12:47 PM, Patrick Hunt wrote: Ben, in this case the session would be tied directly to the connection, we'd explicitly deny session re-establishment for this session type (so 4 would fail). Would that address your concern, others? Patrick On 09/01/2010 10:03 AM, Benjamin Reed wrote: i'm a bit skeptical that this is going to work out properly. a server may receive a socket reset even though the client is still alive: 1) client sends a request to a server 2) client is partitioned from the server 3) server starts trying to send response 4) client reconnects to a different server 5) partition heals 6) server gets a reset from client at step 6 i don't think you want to delete the ephemeral nodes. ben On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote: Yes that's right. Which network issues can cause the socket to close without the initiating process closing the socket? In my limited experience in this area network issues were more prone to leave dead sockets open rather than vice versa so I don't know what to look out for. Thanks, Camille -Original Message- From: Dave Wright [mailto:wrig...@gmail.com] Sent: Tuesday, August 31, 2010 1:14 PM To: zookeeper-user@hadoop.apache.org Subject: Re: closing session on socket close vs waiting for timeout I think he's saying that if the socket closes because of a crash (i.e. not a normal zookeeper close request) then the session stays alive until the session timeout, which is of course true since ZK allows reconnection and resumption of the session in case of disconnect due to network issues. -Dave Wright On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com wrote: That doesn't sound right to me. Is there a Zookeeper expert in the house? On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech] camille.fourn...@gs.com wrote: I foolishly did not investigate the ZK code closely enough and it seems that closing the socket still waits for the session timeout to remove the session.
Re: closing session on socket close vs waiting for timeout
i'm a bit skeptical that this is going to work out properly. a server may receive a socket reset even though the client is still alive: 1) client sends a request to a server 2) client is partitioned from the server 3) server starts trying to send response 4) client reconnects to a different server 5) partition heals 6) server gets a reset from client at step 6 i don't think you want to delete the ephemeral nodes. ben On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote: Yes that's right. Which network issues can cause the socket to close without the initiating process closing the socket? In my limited experience in this area network issues were more prone to leave dead sockets open rather than vice versa so I don't know what to look out for. Thanks, Camille -Original Message- From: Dave Wright [mailto:wrig...@gmail.com] Sent: Tuesday, August 31, 2010 1:14 PM To: zookeeper-user@hadoop.apache.org Subject: Re: closing session on socket close vs waiting for timeout I think he's saying that if the socket closes because of a crash (i.e. not a normal zookeeper close request) then the session stays alive until the session timeout, which is of course true since ZK allows reconnection and resumption of the session in case of disconnect due to network issues. -Dave Wright On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com wrote: That doesn't sound right to me. Is there a Zookeeper expert in the house? On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech] camille.fourn...@gs.com wrote: I foolishly did not investigate the ZK code closely enough and it seems that closing the socket still waits for the session timeout to remove the session.
Re: Session expiration caused by time change
i put up a patch that should address the problem. now i need to write a test case. the only way i can think of is to change the call to System.currentTimeMillis to a utility class that calls System.currentTimeMillis that i can mock for testing. any better ideas? ben On 08/19/2010 03:53 PM, Ted Dunning wrote: Put in a four letter command that will put the server to sleep for 15 seconds! :-) On Thu, Aug 19, 2010 at 3:51 PM, Benjamin Reedbr...@yahoo-inc.com wrote: i'm updating ZOOKEEPER-366 with this discussion and try to get a patch out. Qing (or anyone else, can you reproduce it pretty easily?)
Re: Session expiration caused by time change
yes, you are right. we could do this. it turns out that the expiration code is very simple: while (running) { currentTime = System.currentTimeMillis(); if (nextExpirationTime currentTime) { this.wait(nextExpirationTime - currentTime); continue; } SessionSet set; set = sessionSets.remove(nextExpirationTime); if (set != null) { for (SessionImpl s : set.sessions) { sessionsById.remove(s.sessionId); expirer.expire(s); } } nextExpirationTime += expirationInterval; } so we can detect a jump very easily: if nextExpirationTime currentTime, we have jumped ahead in time. now the question is, what do we do with this information? option 1) we could figure out the jump (nextExpirationTime-currentTime is a good estimate) and move all of the sessions forward by that amount. option 2) we could converge on the time by having a policy to always wait at least a half a tick time. there probably are other options as well. i kind of like option 2. worst case is it will make the sessions expire in half the time that they should, but this shouldn't be too much of a problem since clients send a ping if they are idle for 1/3 of their session timeout. ben On 08/19/2010 08:39 AM, Ted Dunning wrote: True. But it knows that there has been a jump. Quiet time can be distinguished from clock shift by assuming that members of the cluster don't all jump at the same time. I would imagine that a recent clock jump estimate could be kept and buckets that would otherwise expire due to such a jump could be given a bit of a second lease on life, delaying all of their expiration. Since time-outs are relatively short, the server would be able to forget about the bump very shortly. On Thu, Aug 19, 2010 at 8:22 AM, Benjamin Reedbr...@yahoo-inc.com wrote: if we try to use network messages to detect and correct the situation, it seems like we would recreate the problem we are having with ntp, since that is exactly what it does.
Re: Session expiration caused by time change
if we can't rely on the clock, we cannot say things like if ... for 5 seconds. also, clients connect to servers, not visa-versa, so we cannot say things like server can attempt to reconnect. ben On 08/19/2010 10:17 AM, Vishal K wrote: Hi Ted, I haven't give it a serious thought yet, but I don't think it is neccessary for the cluster to keep track of time. A node can make its own decision. For the sake of argument, lets say that we have a client and a server with following policy: 1. Client is supposed to send a ping to server every 1 sec. 2. If server does not hear from client for 5 seconds, then the server declares that the client is dead. 3. Similary if the client cannot communicate with the server for 5 seconds client declares that the server is dead. If the client receives a timeout (say while doing some IO) because of a time jump, it should check the number of pings that has failed with the server. If the number is 5, then this is a true failure, If the number is less than 5, then this is because of a time drift. At the server side, the server can attempt to reconnect (or send a ping to the client) after it receives a timeout. Thus, if the timeout occured because of time drift, the server will reconnect and continue. We should ofcourse have an upper bound in number of retries, etc. For ZK, it is important to handle time jumps on ZK leader. I believe that the pattern of these problems is a slow slippage behind and a sudden jump forward. You won't see the slippage. You will mainly see a jump forward. Note with large enough number of nodes, multiple nodes could see their time jumping forward. Therefore, checking comparing time between two servers may not help. On Thu, Aug 19, 2010 at 7:51 AM, Vishal Kvishalm...@gmail.com wrote: Hi, I remember Ben had opened a jira for clock jumps earlier: https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon to have clocks jump forward in virtualized environments. It is desirable to modify ZooKeeper to handle this situation (as much as possible) internally. It would need to be done for both client - server connections and server - server connections. One obvious solution is to retry a few times (send ping) after getting a timeout. Another way is to count the number of pings that have been sent after receiving the timeout. If number of pings do not match the expected number (say 5 ping attempt should be finished for a 5 sec timeout), then wait till all the pings are finished. In effect do not completely rely on the clock. Any comments? -Vishal On Thu, Aug 19, 2010 at 3:52 AM, Qing Yanqing...@gmail.com wrote: Oh.. our servers are also running in a virtualized environment. On Thu, Aug 19, 2010 at 2:58 PM, Martin Waitewaite@gmail.com wrote: Hi, I have tripped over similar problems testing Red Hat Cluster in virtualised environments. I don't know whether recent linux kernels have improved their interaction with VMWare, but in our environments clock drift caused by lost ticks can be substantial, requiring NTP to sometimes jump the clock rather than control acceleration. In one of our internal production rigs, the local NTP servers themselves were virtualised - causing absolute mayhem when heavy loads hit the other guests on the same physical hosts. The effect on RHCS (v2.0) is quite dramatic. A forward jump in time by 10 seconds always causes a member to prematurely time-out on a network read, causing the member to drop out and trigger a cluster reconfiguration. Apparently NTP is integrated with RHCS version 3, but I don't know what is meant by that. I guess this post is not entirely relevent to ZK, but I am just making the point that virtualisation (of NTP servers and or clients) can cause repeated premature timeouts. On Linux, I believe that there is a class of timers provided that is immune to this, but I doubt that there is a platform independent way of coping with this. My 2p. regards, Martin On 18 August 2010 16:53, Patrick Huntph...@apache.org wrote: Do you expect the time to be wrong frequently? If ntp is running it should never get out of sync more than a small amount. As long as this is less than ~your timeout you should be fine. Patrick On 08/18/2010 01:04 AM, Qing Yan wrote: Hi, The testcase is fairly simple. We have a client which connects to ZK, registers an ephemeral node and watches on it. Now change the client machine's time - session killed.. Here is the log: *2010-08-18
Re: Session expiration caused by time change
i'm updating ZOOKEEPER-366 with this discussion and try to get a patch out. Qing (or anyone else, can you reproduce it pretty easily?) thanx ben On 08/19/2010 09:29 AM, Ted Dunning wrote: Nice (modulo inverting the in your text). Option 2 seems very simple. That always attracts me. On Thu, Aug 19, 2010 at 9:19 AM, Benjamin Reedbr...@yahoo-inc.com wrote: yes, you are right. we could do this. it turns out that the expiration code is very simple: while (running) { currentTime = System.currentTimeMillis(); if (nextExpirationTime currentTime) { this.wait(nextExpirationTime - currentTime); continue; } SessionSet set; set = sessionSets.remove(nextExpirationTime); if (set != null) { for (SessionImpl s : set.sessions) { sessionsById.remove(s.sessionId); expirer.expire(s); } } nextExpirationTime += expirationInterval; } so we can detect a jump very easily: if nextExpirationTime currentTime, we have jumped ahead in time. now the question is, what do we do with this information? option 1) we could figure out the jump (nextExpirationTime-currentTime is a good estimate) and move all of the sessions forward by that amount. option 2) we could converge on the time by having a policy to always wait at least a half a tick time. there probably are other options as well. i kind of like option 2. worst case is it will make the sessions expire in half the time that they should, but this shouldn't be too much of a problem since clients send a ping if they are idle for 1/3 of their session timeout. ben On 08/19/2010 08:39 AM, Ted Dunning wrote: True. But it knows that there has been a jump. Quiet time can be distinguished from clock shift by assuming that members of the cluster don't all jump at the same time. I would imagine that a recent clock jump estimate could be kept and buckets that would otherwise expire due to such a jump could be given a bit of a second lease on life, delaying all of their expiration. Since time-outs are relatively short, the server would be able to forget about the bump very shortly. On Thu, Aug 19, 2010 at 8:22 AM, Benjamin Reedbr...@yahoo-inc.com wrote: if we try to use network messages to detect and correct the situation, it seems like we would recreate the problem we are having with ntp, since that is exactly what it does.
Re: Weird ephemeral node issue
there are two things to keep in mind when thinking about this issue: 1) if a zk client is disconnected from the cluster, the client is essentially in limbo. because the client cannot talk to a server it cannot know if its session is still alive. it also cannot close its session. 2) the client only finds out about session expiration events when the client reconnects to the cluster. if zk tells a client that its session is expired, the ephemerals that correspond to that session will already be cleaned up. one of the main design points about zk is that zk only gives correct information. if zk cannot give correct information, it basically says i don't know. connection loss exceptions and disconnected states are basically i don't know. generally applications we design go into a safe mode, meaning they may serve reads but reject changes, when disconnected from zk and only kill themselves when they find out their session has expired. ben ps - session information is replicated to all zk servers, so if a leader dies, all replicas know the sessions that are currently active and their timeouts. On 08/16/2010 09:03 PM, Ted Dunning wrote: Ben or somebody else will have to repeat some of the detailed logic for this, but it has to do with the fact that you can't be sure what has happened during the network partition. One possibility is the one you describe, but another is that the partition happened because a majority of the ZK cluster lost power and you can't see the remaining nodes. Those nodes will continue to serve any files in a read-only fashion. If the partition involves you losing contact with the entire cluster at the same time a partition of the cluster into a quorum and a minority happens, then your ephemeral files could continue to exist at least until the breach in the cluster itself is healed. Suffice it to say that there are only a few strategies that leave you with a coherent picture of the universe. Importantly, you shouldn't assume that the ephemerals will disappear at the same time as the session expiration event is delivered. On Mon, Aug 16, 2010 at 8:31 PM, Qing Yanqing...@gmail.com wrote: Ouch, is this the current ZK behavior? This is unexpected, if the client get partitioned from ZK cluster, he should get notified and take some action(e.g. commit suicide) otherwise how to tell a ephemeral node is really up or down? Zombie can create synchronization nightmares.. On Mon, Aug 16, 2010 at 7:22 PM, Dave Wrightwrig...@gmail.com wrote: Another possible cause for this that I ran into recently with the c client - you don't get the session expired notification until you are reconnected to the quorum and it informs you the session is lost. If you get disconnected and can't reconnect you won't get the notification. Personally I think the client api should track the session expiration time locally and information you once it's expired. On Aug 16, 2010 2:09 AM, Qing Yanqing...@gmail.com wrote: Hi Ted, Do you mean GC problem can prevent delivery of SESSION EXPIRE event? Hum...so you have met this problem before? I didn't see any OOM though, will look into it more. On Mon, Aug 16, 2010 at 12:46 PM, Ted Dunningted.dunn...@gmail.com wrote: I am assuming that y...
Re: A question about Watcher
zookeeper takes care of reregistering all watchers on reconnect. you don't need to do anything. ben On 08/16/2010 09:04 AM, Qian Ye wrote: Hi all: Will the watchers of a client be losed when the client disconnects from a Zookeeper server? It is said at http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkWatchesthat *When a client reconnects, any previously registered watches will be reregistered and triggered if needed. In general this all occurs transparently.* It means that we need not to do any extra things about watchers if a client disconnected from Zookeeper server A, and reconnect to Zookeeper server B, doesn't it? Or I should reregistered all the watchers if this kind of reconnection happened? thx~
Re: A question about Watcher
good point ted! i should have waited a bit longer before responding :) ben On 08/16/2010 09:20 AM, Ted Dunning wrote: There are two different concepts. One is connection loss. Watchers survive this and the client automatically connects to another member of the ZK cluster. The other is session expiration. Watchers do not survive this. This happens when a client does not provide timely evidence that it is alive and is marked as having disappeared by the cluster. On Mon, Aug 16, 2010 at 9:04 AM, Qian Yeyeqian@gmail.com wrote: Hi all: Will the watchers of a client be losed when the client disconnects from a Zookeeper server? It is said at http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkWatchesthat *When a client reconnects, any previously registered watches will be reregistered and triggered if needed. In general this all occurs transparently.* It means that we need not to do any extra things about watchers if a client disconnected from Zookeeper server A, and reconnect to Zookeeper server B, doesn't it? Or I should reregistered all the watchers if this kind of reconnection happened? thx~ -- With Regards! Ye, Qian
Re: A question about Watcher
the client does keep track of the watches that it has outstanding. when it reconnects to a new server it tells the server what it is watching for and the last view of the system that it had. ben On 08/16/2010 09:28 AM, Qian Ye wrote: thx for explaination. Since the watcher can be preserved when the client switch the zookeeper server it connects to, does that means all the watchers information will be saved on all the zookeeper servers? I didn't find any source of the client can hold the watchers information. On Tue, Aug 17, 2010 at 12:21 AM, Ted Dunningted.dunn...@gmail.com wrote: I should correct this. The watchers will deliver a session expiration event, but since the connection is closed at that point no further events will be delivered and the cluster will remove them. This is as good as the watchers disappearing. On Mon, Aug 16, 2010 at 9:20 AM, Ted Dunningted.dunn...@gmail.com wrote: The other is session expiration. Watchers do not survive this. This happens when a client does not provide timely evidence that it is alive and is marked as having disappeared by the cluster.
Re: How to handle Node does not exist error?
i thought there was a jira about supporting embedded zookeeper. (i remember rejecting a patch to fix it. one of the problems is that we have a couple of places that do System.exit().) i can't seem to find it though. one case that would be great for embedding is writing test cases, so i think it would be useful for that. ben On 08/12/2010 03:25 PM, Ted Dunning wrote: I am not saying that the API shouldn't support embedded ZK. I am just saying that it is almost always a bad idea. It isn't that I am asking you to not do it, it is just that I am describing the experience I have had and that I have seen others have. In a nutshell, embedding leads to problems and it isn't hard to see why. On Thu, Aug 12, 2010 at 3:02 PM, Vishal Kvishalm...@gmail.com wrote: 2. With respect to Ted's point about backward compatibility, I would suggest to take an approach of having an API to support embedded ZK instead of asking users to not embed ZK.
Re: ZK recovery questions
i did a benchmark a while back to see the effect of turning off the disk. (it wasn't as big as you would think.) i had to modify the code. there is an option to turn off the sync in the config that will get you most of the performance you would get by turning off the disk entirely. ben On 07/20/2010 11:01 PM, Ashwin Jayaprakash wrote: I did try a quick test on Windows (yes, some of us use Windows :) I thought simply changing the dataDir to the /dev/null equivalent on Windows would do the trick. It didn't work. It looks like a Java issue because I noticed inconsistencies in the File API regarding this. I wrote about it here - http://javaforu.blogspot.com/2010/07/devnull-on-windows.html devnull-on-windows . BTW the Windows equivalent is nul. This is the error I got on Windows (below). The mkdirs() returns false. As noted on my blog, it returns true for some cases. 2010-07-20 22:25:47,851 - FATAL [main:zookeeperserverm...@62] - Unexpected exception, exiting abnormally java.io.IOException: Unable to create data directory nul:\version-2 at org.apache.zookeeper.server.persistence.FileTxnSnapLog.init(FileTxnSnapLog.java:79) at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:102) at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:85) at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:51) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:108) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:76) Ashwin.
Re: Do implementations of Watcher need to be thread-safe?
as long as a watcher object is only used with a single ZooKeeper object it will be called by the same thread. ben On 07/21/2010 11:12 AM, Joshua Ball wrote: Hi, Do implementations of Watcher need to be thread-safe, or can I assume that process(...) will always be called by the same thread? Thanks, Josh Ua Ball
Re: BookKeeper Doubts
you have concluded correctly. 1) bookkeeper was designed for a process to use as a write-ahead log, so as a simplifying assumption we assume a single writer to a log. we should be throwing an exception if you try to write to a handle that you obtained using openLedger. can you open a jira for that? 2) this is mostly true, there are some exceptions. the creater of a ledger can read entries even though the ledger is still being written to. we would like to add the ability for a reader to assert the last entry in a ledger and read up to that entry, but this is not yet in the code. 3) there is one other bug you are seeing, before a ledger can be read, it must be closed. as your code shows, a process can open a ledger for reading while it is still being written to, which causes an implicit close that is not detected by the writer. this is a nice test case :) thanx ben On 07/17/2010 05:02 PM, André Oriani wrote: Hi, I was not sure if I had understood the behavior of BookKeeper from documentation. So I made a little program, reproduced below, to see what BookKeeper looks like in action. Assuming my code is correct ( you never know when your code has some nasty obvious bugs that only other person than you can see ) , I could draw the follow conclusions: 1) Only the creator can add entries to a ledger, even though you can open the ledger, get a handle and call addEntry on it. No exception is thrown i. In other words, you cannot open a ledger for append. 2) Readers are able to see only the entries that were added to a ledger until someone had opened it for reading. If you want to ensure readers will see all the entries, you must add all entries before any reader attempts to read from the ledger. Could someone please tell me if those conclusions are correct or if I am mistaken? In the later case, could that person also tell me what is wrong ? Thanks a lot for the attention and the patience with this BookKeeper newbie, André package br.unicamp.zooexp.booexp; import java.io.IOException; import java.util.Enumeration; import org.apache.bookkeeper.client.BKException; import org.apache.bookkeeper.client.BookKeeper; import org.apache.bookkeeper.client.LedgerEntry; import org.apache.bookkeeper.client.LedgerHandle; import org.apache.bookkeeper.client.BookKeeper.DigestType; import org.apache.zookeeper.KeeperException; public class BookTest { public static void main (String ... args) throws IOException, InterruptedException, KeeperException, BKException{ BookKeeper bk = new BookKeeper(127.0.0.1); LedgerHandle lh = bk.createLedger(DigestType.CRC32, 123 .getBytes()); long lh_id = lh.getId(); lh.addEntry(Teste.getBytes()); lh.addEntry(Test2.getBytes()); System.out.printf(Got %d entries for lh\n ,lh.getLastAddConfirmed()+1); lh.addEntry(Test3.getBytes()); LedgerHandle lh1 = bk.openLedger(lh_id, DigestType.CRC32, 123 .getBytes()); System.out.printf(Got %d entries for lh1\n ,lh1.getLastAddConfirmed()+1); lh.addEntry(Test4.getBytes()); lh.addEntry(Test5.getBytes()); lh.addEntry(Test6.getBytes()); System.out.printf(Got %d entries for lh\n ,lh.getLastAddConfirmed()+1); EnumerationLedgerEntry seq = lh.readEntries(0, lh.getLastAddConfirmed()); while (seq.hasMoreElements()){ System.out.println(new String(seq.nextElement().getEntry())); } lh.close(); lh1.addEntry(Test7.getBytes()); lh1.addEntry(Test8.getBytes()); System.out.printf(Got %d entries for lh1\n ,lh1.getLastAddConfirmed()+1); seq = lh1.readEntries(0, lh1.getLastAddConfirmed()); while (seq.hasMoreElements()){ System.out.println(new String(seq.nextElement().getEntry())); } lh1.close(); LedgerHandle lh2 = bk.openLedger(lh_id, DigestType.CRC32, 123 .getBytes()); lh2.addEntry(Test9.getBytes()); System.out.printf(Got %d entries for lh2 \n ,lh2.getLastAddConfirmed()+1); seq = lh2.readEntries(0, lh2.getLastAddConfirmed()); while (seq.hasMoreElements()){ System.out.println(new String(seq.nextElement().getEntry())); } bk.halt(); } } Output: Got 2 entries for lh Got 3 entries for lh1 Got 6 entries for lh Teste Test2 Test3 Test4 Test5 Test6 Got 3 entries for lh1 Teste Test2 Test3 Got 3 entries for lh2 Teste Test2 Test3
RE: cleanup ZK takes 40-60 seconds
how big is your database? it would be good to know the timing of the two calls. shutdown should take very little time. sent from my droid -Original Message- From: Vishal K [vishalm...@gmail.com] Received: 7/16/10 6:31 PM To: zookeeper-user@hadoop.apache.org [zookeeper-u...@hadoop.apache.org] Subject: cleanup ZK takes 40-60 seconds Hi, We have embedded ZK server in our application. We start a thread in our application and call QuorumPeerMain.InitializeArgs(). When cleaning-up ZK we call QuorumPeerMain.shutdown() and wait for the thread that is calling InitializeArgs() to finish. These two steps are taking around 60 seconds. I could probably not wait for InitializeArgs() to finish and that might speed up things. However, I am not sure why the cleanup should take such a long time. Can anyone comment on this? Thanks. -Vishal
Re: total # of zknodes
i think there is a wiki page on this, but for the short answer: the number of znodes impact two things: memory footprint and recovery time. there is a base overhead to znodes to store its path, pointers to the data, pointers to the acl, etc. i believe that is around 100 bytes. you cant just divide your memory by 100+1K (for data) though, because the GC needs to be able to run and collect things and maintain a free space. if you use 3/4 of your available memory, that would mean with 4G you can store about three million znodes. when there is a crash and you recover, servers may need to read this data back off the disk or over the network. that means it will take about a minute to read 3G from the disk and perhaps a bit more to read it over the network, so you will need to adjust your initLimit accordingly. of course this is all back-of-the-envelope. i would suggest doing some quick benchmarks to test and make sure your results are in line with expectation. ben On 07/15/2010 02:56 AM, Maarten Koopmans wrote: Hi, I am mapping a filesystem to ZooKeeper, and use it for locking and mapping a filesystem namespace to a flat data object space (like S3). So assuming proper nesting and small ZooKeeper nodes ( 1KB), how many nodes could a cluster with a few GBs of memory per instance realistically hold totally? Thanks, Maarten
Re: Achieving quorum with only half of the nodes
by custom QuorumVerifier are you referring to http://hadoop.apache.org/zookeeper/docs/r3.3.1/zookeeperHierarchicalQuorums.html ? ben On 07/14/2010 12:43 PM, Sergei Babovich wrote: Hi, We are currently evaluating use of ZK in our infrastructure. In our setup we have a set of servers running from two different power feeds. If one power feed goes away so does half of the servers. This makes problematic to configure ZK ensemble that would tolerate such outage. The network partitioning is not an issue in our case. The only solution I come up with so far is to provide custom QuorumVerifier that will add a little premium in case if all servers in the quorum set are from the same group. Basically if we have only half of votes but all of them belong to the same group then we decide to have a quorum. Any ideas or better solutions are very appreciated. Sorry if this has been already discussed/answered. Regards, Sergei This e-mail message and all attachments transmitted with it may contain privileged and/or confidential information intended solely for the use of the addressee(s). If the reader of this message is not the intended recipient, you are hereby notified that any reading, dissemination, distribution, copying, forwarding or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and delete this message, all attachments and all copies and backups thereof.
Re: running the systest
can you try the following: Index: src/contrib/fatjar/build.xml === --- src/contrib/fatjar/build.xml(revision 962637) +++ src/contrib/fatjar/build.xml(working copy) @@ -46,6 +46,7 @@ fileset dir=${zk.root}/build/classes excludes=**/.generated/ fileset dir=${zk.root}/build/test/classes/ zipgroupfileset dir=${zk.root}/build/lib includes=*.jar / + zipgroupfileset dir=${zk.root}/build/test/lib includes=*.jar / zipgroupfileset dir=${zk.root}/src/java/lib includes=*.jar / /jar /target thanx ben On 07/09/2010 09:04 AM, Stuart Halloway wrote: Happy to do it. Should I change the fatjar build to add junit, or is there another way folks prefer to do it? I am assuming that somebody is running the tests and has a local workaround in place. :-) Stu Hi Stuart, The instructions are just out of date. If you could open a jira and post a patch to it that would be great! We should try getting this in 3.3.2! That would be useful! Thanks mahadev On 7/9/10 6:36 AM, Stuart Hallowaystuart.hallo...@gmail.com wrote: Hi all, I am trying to run the systest and have hit a few minor issues: (1) The readme says src/contrib/jarjar, apparently should be src/contrib/fatjar (2) The compiled fatjar seems to be missing junit, so the launch instructions do not work. I can fix or workaround these, but I wanted to see if maybe the instructions are just out of date, and there is an easy (but currently undocumented) way to launch the tests. Thanks, Stu
Re: Suggested way to simulate client session expiration in unit tests?
the difference between close and disconnect is that close will actually try to tell the server to kill the session before disconnecting. a paranoid lock implementation doesn't need to test it's session. it should just monitor watch events to look for disconnect and expired events. if a client is in the disconnected state, it cannot reliably know if the session is still active, so it should consider the lock in limbo until it gets either the reconnect event or the expired event. ben On 07/06/2010 05:42 PM, Jeremy Davis wrote: Thanks! That seems to work, but it is approximately the same as zooKeeper.close() in that there is no SessionExpired event that comes up through the default Watcher. Maybe I'm assuming more from ZK than I should, but should a paranoid lock implementation periodically test it's session by reading or writing a value? Regards, -JD On Tue, Jul 6, 2010 at 10:32 AM, Mahadev Konarmaha...@yahoo-inc.comwrote: Hi Jeremy, zk.disconnect() is the right way to disconnect from the servers. For session expiration you just have to make sure that the client stays disconnected for more than the session expiration interval. Hope that helps. Thanks mahadev On 7/6/10 9:09 AM, Jeremy Davisjerdavis.cassan...@gmail.com wrote: Is there a recommended way of simulating a client session expiration in unit tests? I see a TestableZooKeeper.java, with a pauseCnxn() method that does cause the connection to timeout/disconnect and reconnect. Is there an easy way to push this all the way through to session expiration? Thanks, -JD
Re: integration tests
we do this in our tests for ZooKeeper. bookkeeper uses the testing classes as well, unfortunately, we haven't documented the interface. ben On 06/22/2010 08:42 PM, Ishaaq Chandy wrote: Hi all, First some background: 1. We use maven as our build tool. 2. We use Hudson as our CI server, it is setup to delegate build work to a cluster of build-slave VMs. 3. We'd like to do very little (preferably none at all) admin work on each build-slave VM to get it up and running builds. This is so we can grow the build cluster quickly on demand. To this end we'd like our tests to be able to run without requiring external dependencies, i.e. no requirement that there be a running database or some such. We use Cassandra for data storage and have been able to quite successfully write a test extension that configures and starts up an embedded Cassandra instance before running tests that rely on Cassandra. Now, I'd like to do the same for ZooKeeper. Has anyone tackled and solved this problem before? Thanks, Ishaaq
Re: is ZK client thread safe
yes. (except for the single threaded C-client library :) ben On 06/17/2010 10:16 AM, Jun Rao wrote: Hi, Is ZK client thread safe? Is it ok for multiple threads sharing the same ZK client? Thanks, Jun
Re: Completions in C API
the call is executed at a later time on a different thread. the zoo_a* calls are non-blocking, so (subject to the thread scheduling) usually they will return before the request completes. ben On 06/03/2010 01:24 PM, Jack Orenstein wrote: I'm trying to figure out how to use zookeeper's C API. In particular, what can I assume about the execution of the completions passed to zoo_aget and zoo_aset? The documentation for zoo_aset says the completion is the routine to invoke when the request completes. Does this mean that the completion is called some arbitrary time after the request completes, on a different thread? Or is it guaranteed to be executed on the same thread as the thread initiating the zoo_aset call, and complete before zoo_aset returns? Or something else? In general, the 3.2.2 docs seem to be pretty thin on the C API. Pointers to other relevant material would be appreciated. Jack
Re: zookeeper crash
charity, do you mind going through your scenario again to give a timeline for the failure? i'm a bit confused as to what happened. ben On 06/02/2010 01:32 PM, Charity Majors wrote: Thanks. That worked for me. I'm a little confused about why it threw the entire cluster into an unusable state, though. I said before that we restarted all three nodes, but tracing back, we actually didn't. The zookeeper cluster was refusing all connections until we restarted node one. But once node one had been dropped from the cluster, the other two nodes formed a quorum and started responding to queries on their own. Is that expected as well? I didn't see it in ZOOKEEPER-335, so thought I'd mention it. On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote: Hi Charity, unfortunately this is a known issue not specific to 3.3 that we are working to address. See this thread for some background: http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html I've raised the JIRA level to blocker to ensure we address this asap. As Ted suggested you can remove the datadir -- only on the effected server -- and then restart it. That should resolve the issue (the server will d/l a snapshot of the current db from the leader). Patrick On 06/02/2010 11:11 AM, Charity Majors wrote: I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an attempt to get away from a client bug that was crashing my backend services. Unfortunately, this morning I had a server crash, and it brought down my entire cluster. I don't have the logs leading up to the crash, because -- argghffbuggle -- log4j wasn't set up correctly. But I restarted all three nodes, and odes two and three came back up and formed a quorum. Node one, meanwhile, does this: 2010-06-02 17:04:56,446 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING 2010-06-02 17:04:56,446 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot /services/zookeeper/data/zookeeper/version-2/snapshot.a0045 2010-06-02 17:04:56,476 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id = 1, Proposed zxid = 47244640287 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 47244640287, 4, 1, LOOKING, LOOKING, 1 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 38654707048, 3, 1, LOOKING, LEADING, 3 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir /services/zookeeper/data/zookeeper/version-2 snapdir /services/zookeeper/data/zookeeper/version-2 2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is less than our epoch b 2010-06-02 17:04:56,486 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following the leader java.io.IOException: Error: Epoch of leader is lower at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644) 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648) All I can find is this, http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html, which implies that this state should never happen. Any suggestions? If it happens again, I'll just have to roll everything back to 3.2.1 and live with the client crashes.
Re: problem connecting to zookeeper server
good catch lei! if this helps gregory, can you open a jira to throw an exception in this situation. we should be throwing an invalid argument exception or something in this case. thanx ben On 05/20/2010 09:04 AM, Lei Zhang wrote: Seems you are passing in wrong arguments: Should have been: public ZooKeeper(String connectString, int sessionTimeout, Watcher watcher) throws IOException What you have in your client code is: On Thu, May 20, 2010 at 5:21 AM, Gregory Haskins gregory.hask...@gmail.comwrote: public App() throws Exception { zk = new ZooKeeper(192.168.1.124:2181, 0, this); } Try use a sensible timeout value such as 2. The error you are getting means the server has timed out the session. Hope this unstucks you.
Re: Xid out of order. Got 8 expected 7
is this a bug? shouldn't we be returning an error. ben On 05/12/2010 11:34 AM, Patrick Hunt wrote: I think that explains it then - the server is probably dropping the new (3.3.0) getChildren message (xid 7) as it (3.2.2 server) doesn't know about that message type. Then the server responds to the client for a subsequent operation (xid 8), and at that point the client notices that getChildren (xid 7) got lost. Patrick On 05/12/2010 11:30 AM, Jordan Zimmerman wrote: Oh, OK. When I get a moment I'll restart the 3.2.2 and post logs, etc. Yes, we're calling getChildren with the callback. -JZ On May 12, 2010, at 11:28 AM, Patrick Hunt wrote: I'm still interested though... Are you using the new getChildren api that was added to the client in 3.3.0? (it provides a Stat object on return, the old getChildren did not). While we don't officially support 3.3.0 client with 3.2.2 server (we do support the other way around), there shouldn't be they type of problem with this configuration as you describe. I'd still be interested for you to create that jira. Regards, Patrick On 05/12/2010 11:23 AM, Jordan Zimmerman wrote: Apologies... I thought I was running 3.3.0 server, but was running 3.2.2 server with 3.3.0 client. I upgraded the server and now all works again. Sorry to trouble y'all. -Jordan On May 12, 2010, at 11:11 AM, Patrick Hunt wrote: Hi Jordan, you've seen this once or frequently? (having the server + client logs will help alot) Patrick On 05/12/2010 11:08 AM, Jordan Zimmerman wrote: Sure - if you think it's a bug. We were using Zookeeper without issue. I then refactored a bunch of code and this new behavior started. I'm starting ZK using zkServer start and haven't made any changes to the code at all. I'll get the logs together and post a JIRA. -JZ On May 12, 2010, at 10:59 AM, Mahadev Konar wrote: Hi Jordan, Can you create a jira for this? And attach all the server logs and client logs related to this timeline? How did you start up the servers? Is there some changes you might have made accidentatlly to the servers? Thanks mahadev On 5/12/10 10:49 AM, Jordan Zimmermanjzimmer...@proofpoint.com wrote: We've just started seeing an odd error and are having trouble determining the cause. Xid out of order. Got 8 expected 7 Any hints on what can cause this? Any ideas on how to debug? We're using ZK 3.3.0. The error occurs in ClientCnxn.java line 781 -Jordan
Re: How to ensure trasaction create-and-update
i agree with ted. i think he points out some disadvantages with trying do do more. there is a slippery slope with these kinds of things. the implementation is complicated enough even with the simple model that we use. ben On 03/29/2010 08:34 PM, Ted Dunning wrote: I perhaps should not have said power, except insofar as ZK's strengths are in reliability which derives from simplicity. There are essentially two common ways to implement multi-node update. The first is the tradtional db style with begin-transaction paired with either a commit or a rollback after some number of updates. This is clearly unacceptable in the ZK world if the updates are sent to the server because there can be an indefinite delay between the begin and commit. A second approach is to buffer all of the updates on the client side and transmit them in a batch to the server to succeed or fail as a group. This allows updates to be arbitrarily complex which begins to eat away at the no-blocking guarantee a bit. On Mon, Mar 29, 2010 at 8:08 PM, Henry Robinsonhe...@cloudera.com wrote: Could you say a bit about how you feel ZK would sacrifice power and reliability through multi-node updates? My view is that it wouldn't: since all operations are executed serially, there's no concurrency to be lost by allowing multi-updates, and there doesn't need to be a 'start / end' transactional style interface (which I do believe would be very bad). I could see ZK implement a Sinfonia-style batch operation API which makes all-or-none updates. The reason I can see that it doesn't already allow this is the avowed intent of the original ZK team to keep the API as simple as it can reasonably be, and to not introduce complexity without need.
Re: Solitication for logging/debugging requirements
awesome! that would be great ivan. i'm sure pat has some more concrete suggestions, but one simple thing to do is to run the unit tests and look at the log messages that get output. there are a couple of categories of things that need to be fixed (this is in no way exhaustive): 1) messages that have useful information, but only if you look in the code to figure out what it means. there are some leader election messages that fall into this category. it would be nice to clarify them. 2) there are error messages that really aren't errors. when shutting down there are a bunch of errors that are expected, but still logged, for example. 3) misclassified error levels welcome aboard! ben On 03/29/2010 10:07 AM, Ivan Kelly wrote: Hi, Im going to be using Zookeeper quite extensively for a project in a few weeks, but development hasn't kicked off yet. This means I have some time on my hands and I'd like to get familiar with zookeeper beforehand by perhaps writing some tools to make debugging problems with it easier so as to save myself some time in the future. Problem is I haven't had to debug many zookeeper problems yet, so I don't know where the pain points are. So, without further ado, - Are there any places that logging is deficient that sorely needs improvement? - Could current logs be improved any amount or presented in a more readable fashion? - Would some form of log visualisation be useful (for example in something approximating a sequence diagram)? Feel free to suggest anything which the list above doesn't allude to which you think would be helpful. Cheers, Ivan
Re: cluster fails to start - broken snapshot?
we have updated ZOOKEEPER-713 with much more detail, but the bottom line is that the Invalid snapshot was caused by an OutOfMemoryError. this turns out not be a problem since we recover using an older snapshot. there are other things that are happening that are the real causes of the problem. see the jira for details. thanx ben On 03/18/2010 09:16 AM, Łukasz Osipiuk wrote: Hi guys, Today we experienced another problem with our zookeeper installation. Due to large attachments I created jira issue for it, even though it is rather question than bug report. https://issues.apache.org/jira/browse/ZOOKEEPER-713 Description below: Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected. We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase. Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message In one of logs I noticed 'Invalid snapshot' message which disturbed me a bit. We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems. Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start? Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral. I am attaching contents of version-2 directory from all nodes and server logs. Source problem occurred some time before 15. First cluster restart happened at 15:03. At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions. I am also attaching zoo.cfg. Maybe something is wrong at this place. As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000). Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime. Best regards, Łukasz Osipiuk PS. due to attachment size limit I used split. to untar use cat nodeX-version-2.tgz-* |tar -xz
Re: syncLimit explanation needed?
yes it means in sync with the leader. syncLimit governs the timeout when a follower is actively following a leader. initLimit is the initial connection timeout. because there is the potential for more data that needs to be transmitted during the initial connection, we want to be able to manage the two timeouts differently. ben On 03/18/2010 11:48 AM, César Álvarez Núñez wrote: Hi all, I'm would like to get a better understanding of syncLimit configuration property. Accordingly to Administration Guide: Amount of time, in ticks (see tickTimefile:/ecija127.ecija.com/Compartido/ThirdPartyLibs/zookeeper-3.2.2/docs/zookeeperAdmin.html#id_tickTime), to allow followers to sync with ZooKeeper. If followers fall too far behind a leader, they will be dropped. ...to sync with ZooKeeper means ...to sync with Leader? In this case, which is the difference with initLimit? BR, /César.
Re: permanent ZSESSIONMOVED
do you ever use zookeeper_init() with the clientid field set to something other than null? ben On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote: Hi everyone! I am writing to this group because recently we are getting some strange errors with our production zookeeper setup. From time to time we are observing that our client application (C++ based) disconnects from zookeeper (session state is changed to 1) and reconnects (state changed to 3). This itself is not a problem - usually application continues to run without problems after reconnect. But from time to time after above happens all subsequent operations start to return ZSESSIONMOVED error. To make it work again we have to restart application (which creates new zookeeper session). I noticed that in 3.2.0 introduced a bug http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using zookeeper v. 3.2.2. I just noticed that app at compile time used 3.2.0 library but patches fixing bug 449 did not touch C client lib so I believe that our problems are not related with that. In zookeeper logs at moment which initiated the problem with client application I have node1: 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to /10.1.112.61:37197 lastZxid 42992576502 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,443 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init of 0x324dcc1ba580085 valid:true 2010-03-16 14:21:49,443 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:49,444 - INFO [NIOServerCxn.Factory:2181:nioserverc...@857] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181 remote=/10.1.112.61:37197] node2: 2010-03-16 14:21:40,580 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:40,581 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181 remote=/10.1.112.61:60693] 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to /10.1.112.61:48336 lastZxid 42992576502 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,439 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init of 0x324dcc1ba580085 valid:true node3: 2010-03-16 02:14:48,961 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 02:14:48,962 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181 remote=/10.1.112.61:57309] and then lots of entries like this 2010-03-16 02:14:54,696 - WARN [ProcessThread:-1:preprequestproces...@402] - Got exception when processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49 zxid:0xfffe txntype:unknown /locks/9871253/lock-8589943989- org.apache.zookeeper.KeeperException$SessionMovedException: KeeperErrorCode = Session moved at org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) 2010-03-16 14:22:06,428 - WARN [ProcessThread:-1:preprequestproces...@402] - Got exception when processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9f6603 zxid:0xfffe txntype:unknown /locks/1665960/lock-8589961006- org.apache.zookeeper.KeeperException$SessionMovedException: KeeperErrorCode = Session moved at org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) To workaround disconnections I am going to increase session timeout from 5 to 15 seconds but event if it helps at all it is just a workaround. Do you have an idea where is the source of my problem. Regards, Łukasz Osipiuk
Re: permanent ZSESSIONMOVED
weird, this does sound like a bug. do you have a reliable way of reproducing the problem? thanx ben On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote: nope. I always pass 0 as clientid. Łukasz On Tue, Mar 16, 2010 at 16:20, Benjamin Reedbr...@yahoo-inc.com wrote: do you ever use zookeeper_init() with the clientid field set to something other than null? ben On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote: Hi everyone! I am writing to this group because recently we are getting some strange errors with our production zookeeper setup. From time to time we are observing that our client application (C++ based) disconnects from zookeeper (session state is changed to 1) and reconnects (state changed to 3). This itself is not a problem - usually application continues to run without problems after reconnect. But from time to time after above happens all subsequent operations start to return ZSESSIONMOVED error. To make it work again we have to restart application (which creates new zookeeper session). I noticed that in 3.2.0 introduced a bug http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using zookeeper v. 3.2.2. I just noticed that app at compile time used 3.2.0 library but patches fixing bug 449 did not touch C client lib so I believe that our problems are not related with that. In zookeeper logs at moment which initiated the problem with client application I have node1: 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to /10.1.112.61:37197 lastZxid 42992576502 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,443 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init of 0x324dcc1ba580085 valid:true 2010-03-16 14:21:49,443 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:49,444 - INFO [NIOServerCxn.Factory:2181:nioserverc...@857] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181 remote=/10.1.112.61:37197] node2: 2010-03-16 14:21:40,580 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:40,581 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181 remote=/10.1.112.61:60693] 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to /10.1.112.61:48336 lastZxid 42992576502 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,439 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init of 0x324dcc1ba580085 valid:true node3: 2010-03-16 02:14:48,961 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 02:14:48,962 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181 remote=/10.1.112.61:57309] and then lots of entries like this 2010-03-16 02:14:54,696 - WARN [ProcessThread:-1:preprequestproces...@402] - Got exception when processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49 zxid:0xfffe txntype:unknown /locks/9871253/lock-8589943989- org.apache.zookeeper.KeeperException$SessionMovedException: KeeperErrorCode = Session moved at org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) 2010-03-16 14:22:06,428 - WARN [ProcessThread:-1:preprequestproces...@402] - Got exception when processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9f6603 zxid:0xfffe txntype:unknown /locks/1665960/lock-8589961006- org.apache.zookeeper.KeeperException$SessionMovedException: KeeperErrorCode = Session moved at org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) To workaround disconnections I am going to increase session timeout from 5 to 15 seconds but event if it helps at all it is just a workaround. Do you have an idea where is the source of my problem. Regards, Łukasz Osipiuk
Re: Managing multi-site clusters with Zookeeper
it is a bit confusing but initLimit is the timer that is used when a follower connects to a leader. there may be some state transfers involved to bring the follower up to speed so we need to be able to allow a little extra time for the initial connection. after that we use syncLimit to figure out if a leader or follower is dead. a peer (leader or follower) is considered dead if syncLimit ticks goes by without hearing from the other machine. (this is after the initial connection has been made.) please open a jira to made the text a bit more explicit. feel free to add suggestions :) thanx ben On 03/15/2010 04:17 AM, Michael Bauland wrote: Hi Patrick, I'm also setting up a Zookeeper ensemble across three different locations and I've got some questions regarding the parameters as specified on the page you mentioned: That's controlled by the tickTime/synclimit/initlimit/etc.. see more about this in the admin guide: http://bit.ly/c726DC - What's the difference between initLimit and syncLimit? For initLimit it says this is the time to allow followers to connect and sync to a leader, and syncLimit is the time to allow followers to sync with ZooKeeper. To me this sounds very similar, since Zookeeper in the second definition probably means the Zookeeper leader, doesn't it? - When I connect with a client to the Zookeeper ensemble I provide the three IP addresses of my three Zookeeper servers. Does the client then choose one of them arbitrarily or will it always try to connect to the first one first? I'm asking since I would like to have my clients first try to connect to the local Zookeeper server and only if that fails (for whatever reason, maybe it's down) it should try to connect to one of the servers on a different physical location. You'll want to increase from the defaults since those are typically for high performance interconnect (ie within colo). You are correct though, much will depend on your env. and some tuning will be involved. Do you have any suggestions for the parameters? So far I left tickTime at 2 sec and increased initLimit and syncLimit to 30 (i.e., one minute). Our sites are connected with 1Gbit to the Internet, but of course we have no influence on what's in between. The data managed by zookeeper is quite large (snapshots are 700 MByte, but they may increase in the future). Thanks for your help, Michael
RE: When session expired event fired?
i was looking through the docs to see if we talk about handling session expired, but i couldn't find anything. we should probably open a jira to add to the docs, unless i missed something. did i? ben -Original Message- From: Mahadev Konar [mailto:maha...@yahoo-inc.com] Sent: Monday, February 08, 2010 2:43 PM To: zookeeper-user@hadoop.apache.org Subject: Re: When session expired event fired? Hi, a zookeeper client does not expire a session until and unless it is able to connect to one of the servers. In your case if you kill all the servers, the client is not able to connect to any of the servers and will keep trying to connect to the three servers. It cannot expire a session on its own and needs to hear from the server to know if the session is expired or not. Does that help? Thanks mahadev On 2/8/10 2:37 PM, neptune opennept...@gmail.com wrote: Hi all. I have a question. I started zookeeper(3.2.2) on three servers. When session expired event fired in following code? I expected that if client can't connect to server(disconnected) for session timeout, zookeeper fires session expired event. I killed three zookeeper server sequentially. Client retry to connect zookeeper server. Never occured Expired event. *class WatcherTest { public static void main(String[] args) { (new **WatcherTest*()).exec(); * } private WatcherTest() throws Exception { zk = new ZooKeeper(server1:2181,server2:2181:server3:2181, 10 * 1000, this); } private void exec() { while(ture) { //do something } } public void process(WatchedEvent event) { if (event.getType() == Event.EventType.None) { switch (event.getState()) { case SyncConnected: System.out.println(ZK SyncConnected); break; case Disconnected: System.out.println(ZK Disconnected); break; case Expired: System.out.println(ZK Session Expired); System.exit(0); break; } } } *
Re: how to handle re-add watch fails
sadly connectionloss is the really ugly part of zookeeper! it is a pain to deal with. i'm not sure we have best practice, but i can tell you what i do :) ZOOKEEPER-22 is meant to alleviate this problem. i usually use the asynch API when handling the watch callback. in the completion function if there is a connection loss, i issue another async getChildren to retry. this avoids the blocking caller by doing a synchronous retry that eric alluded to, but the behavior is effectively the same: you retry the request. you don't need to worry about multiple watches being added colin. zookeeper keeps track of which watchers have registered which watches and will not register deplicate watches for the same watcher. (hopefully you can parse that :) ben Colin Goodheart-Smithe wrote: We are having similar problems to this. At the moment we wrap ZooKeeper in a class which retries requests on KeeperException.ConnectionLoss to avoid no watcher being added, but we are worried that this may result in multiple watchers being added if the watcher is successfully added but the server returns a Connection Loss Colin -Original Message- From: Eric Bowman [mailto:ebow...@boboco.ie] Sent: 01 February 2010 10:22 To: zookeeper-user@hadoop.apache.org Subject: Re: how to handle re-add watch fails I was surprised to not get a response to this ... is this a no-brainer? Too hard to solve? Did I not express it clearly? Am I doing something dumb? :) Thanks, Eric On 01/25/2010 01:05 PM, Eric Bowman wrote: I'm curious, what is the best practice for how to handle the case where re-adding a watch inside a Watcher.process callback fails? I keep stumbling upon the same kind of thing, and the possibility of race conditions or undefined behavior keep troubling me. Maybe I'm missing something. Suppose I have a callback like: public void process( WatchedEvent watchedEvent ) { if ( watchedEvent.getType() == Event.EventType.NodeChildrenChanged ) { try { ... do stuff ... } catch ( Throwable e ) { log.error( Could not do stuff!, e ); } try { zooKeeperManager.watchChildren( zPath, this ); } catch ( InterruptedException e ) { log.info( Interrupted adding watch -- shutting down? ); return; } catch ( KeeperException e ) { // oh crap, now what? } } } (In this cases, watchChildren is just calling getChildren and passing the watcher in.) It occurs to me I could get more and more complicated here: I could wrap watchChildren in a while loop until it succeeds, but that seems kind of rude to the caller. Plus what if I get a KeeperException.SessionExpiredException or a KeeperException.ConnectionLossException? How to handle that in this loop? Or I could send some other thread a message that it needs to keep trying until the watch has been re-added ... but ... yuck. I would very much like to just setup this watch once, and have ZooKeeper make sure it keeps firing until I tear down ZooKeeper -- this logic seems tricky for clients, and quite error prone and full of race conditions. Any thoughts? Thanks, Eric
Re: Q about ZK internal: how commit is being remembered
henry is correct. just to state another way, Zab guarantees that if a quorum of servers have accepted a transaction, the transaction will commit. this means that if less than a quorum of servers have accepted a transaction, we can commit or discard. the only constraint we have in choosing is ordering. we have to decide which partially accepted transactions are going to be committed and which discarded before we propose any new messages so that ordering is preserved. ben Henry Robinson wrote: Hi - Note that a machine that has the highest received zxid will necessarily have seen the most recent transaction that was logged by a quorum of followers (the FIFO property of TCP again ensures that all previous messages will have been seen). This is the property that ZAB needs to preserve. The idea is to avoid missing a commit that went to a node that has since failed. I was therefore slightly imprecise in my previous mail - it's possible for only partially-proposed proposals to be committed if the leader that is elected next has seen them. Only when another proposal is committed instead must the original proposal be discarded. I highly recommend Ben Reed's and Flavio Junqueira's LADIS paper on the subject, for those with portal.acm.org access: http://portal.acm.org/citation.cfm?id=1529978 Henry On 27 January 2010 21:52, Qian Ye yeqian@gmail.com wrote: Hi Henry: According to your explanation, *ZAB makes the guarantee that a proposal which has been logged by a quorum of followers will eventually be committed* , however, the source code of Zookeeper, the FastLeaderElection.java file, shows that, in the election, the candidates only provide their zxid in the votes, the one with the max zxid would win the election. I mean, it seems that no check has been made to make sure whether the latest proposal has been logged by a quorum of servers. In this situation, the zookeeper would deliver a proposal, which is known as a failed one by the client. Imagine this scenario, a zookeeper cluster with 5 servers, Leader only receives 1 ack for proposal A, after a timeout, the client is told that the proposal failed. At this time, all servers restart due to a power failure. The server have the log of proposal A would be the leader, however, the client is told the proposal A failed. Do I misunderstand this? On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson he...@cloudera.com wrote: Qing - That part of the documentation is slightly confusing. The elected leader must have the highest zxid that has been written to disk by a quorum of followers. ZAB makes the guarantee that a proposal which has been logged by a quorum of followers will eventually be committed. Conversely, any proposals that *don't* get logged by a quorum before the leader sending them dies will not be committed. One of the ZAB papers covers both these situations - making sure proposals are committed or skipped at the right moments. So you get the neat property that leader election can be live in exactly the case where the ZK cluster is live. If a quorum of peers aren't available to elect the leader, the resulting cluster won't be live anyhow, so it's ok for leader election to fail. FLP impossibility isn't actually strictly relevant for ZAB, because FLP requires that message reordering is possible (see all the stuff in that paper about non-deterministically drawing messages from a potentially deliverable set). TCP FIFO channels don't reorder, so provide the extra signalling that ZAB requires. cheers, Henry 2010/1/26 Qing Yan qing...@gmail.com Hi, I have question about how zookeeper *remembers* a commit operation. According to http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary quote The leader will issue a COMMIT to all followers as soon as a quorum of followers have ACKed a message. Since messages are ACKed in order, COMMITs will be sent by the leader as received by the followers in order. COMMITs are processed in order. Followers deliver a proposals message when that proposal is committed. /quote My question is will leader wait for COMMIT to be processed by quorum of followers before consider COMMIT to be success? From the documentation it seems that leader handles COMMIT asynchronously and don't expect confirmation from followers. In the extreme case, what happens if leader issue a COMMIT to all followers and crash immediately before the COMMIT message can go out of the network. How the system remembers the COMMIT ever happens? Actually this is related to the leader election process: quote ZooKeeper messaging doesn't care about the exact method of electing a leader has long as the following holds: - The leader has seen the highest zxid of all the followers. - A quorum of servers have committed to following the leader. Of these two
Re: ZAB kick Paxos butt?
hi Qing, i'm glad you like the page and Zab. yes, we are very familiar with Paxos. that page is meant to show a weakness of Paxos and a design point for Zab. it is not to say Paxos is not useful. Paxos is used in the real world in production systems. sometimes there are not order dependencies between messages, so Paxos is fine. in cases where order is important, multiple messages are batched into a single operation and only one operation is outstanding at a time. (i believe that this is what Chubby does, for example.) this is the solution you allude to: wait for 27 to commit before 28 is issued. for ZooKeeper we do have order dependencies and we wanted to have multiple operations in progress at various stages of the pipeline to allow us to lower latencies as well as increase our bandwidth utilization, which led us to Zab. ben Qing Yan wrote: Hello, Anyone familer with Paxos protocol here? I was doing some comparision of ZAB vs Paxos... first of all, ZAB's FIFO based protocol is really cool! http://wiki.apache.org/hadoop/ZooKeeper/PaxosRun mentioned the inconsistency case for Paxos(the state change B depends upon A, but A was not committed). In the Paxos made simple paper, author suggests fill the GAP (lost state machine changes) with NO OP opeartion. Now I have some serious doubts how could Paxos be any useful in the real world. yeah you do reach the consesus - albeit the content is inconsistent/corrupted !? E.g. on the wiki page, why the Paxos state machine allow fire off 27,28 concurrently where there is actually depedency? Shouldn't you wait instance 27 to be committed before start 28? Did I miss something? Thanks for the enlight! Cheers Qing
RE: Does zookeeper support listening on a specified address?
no please open a jira as a new feature request. sent from my droid -Original Message- From: Steve Chu [stv...@gmail.com] Received: 12/21/09 3:44 AM To: zookeeper-user@hadoop.apache.org [zookeeper-u...@hadoop.apache.org] Subject: Does zookeeper support listening on a specified address? Hi, all, I only see clientPort option in configuration. Does zookeeper support bind to a specified network address, because in my box, multiple network interface presents and I want bind to specified one. I checked src/java/main/org/apache/zookeeper/server/ServerConfig.java, seems no server address option. Best Regards, Steve
Re: Share Zookeeper instance and Connection Limits
I agree with Ted, it doesn't seem like a good idea to do in practice. however, you do have a couple of options if you are just testing things: 1) use tmpfs 2) you can set forceSync to no in the configuration file to disable syncing to disk before acknowledging responses 3) if you really want to make the disk write go away, you can modify the SyncRequestProcessor in the code ben Ted Dunning wrote: I think that htis would be a very bad idea because of restart issues. As it stands, ZK reads from disk snapshots on startup to avoid moving as much data from other members of the cluster. You might consider putting the snapshots and log on a tmpfs file system if you really, really want this. On Wed, Dec 16, 2009 at 1:08 PM, Thiago Borges thbor...@gmail.com wrote: Can Zookeeper ensemble runs only in memory rather than write in both memory and disk? This makes senses since I have a high reliable system? (Of course at some time we need a dump to shutdown and restart the entire system). Well, the disk IO or network first limits the throughput? Thanks for you quick response. I'm studding Zookeeper in my master thesis, for coordinate distributed index structures.
Re: size of data / number of znodes
there aren't any limits on the number of znodes, it's just limited by your memory. there are two things (probably more :) to keep in mind: 1) the 1M limit also applies to the children list. you can't grow the list of children to more than 1M (the sum of the names of all of the children) otherwise you cannot to a getChildren(). so, yes, you need to do some bucketing to keep the number of children to something reasonable. assuming your names will be less than 100 bytes, you probably want to limit the number of children to 10,000. 2) since there are times that you need to do a state transfer between servers (dump all the state from one to the other to bring it online) it may take a while depending on your network speed. you may need to bump up the default initLimit, so make sure you do some benchmarking on your platform to test your configuration parameters. ben Michael Bauland wrote: Hello, I'm new to the Zookeeper project and wondering whether our use case is a good one for Zookeeper. I read the documentation, but couldn't find an answer. At some point it says that A common property of the various forms of coordination data is that they are relatively small: measured in kilobytes. The ZooKeeper client and the server implementations have sanity checks to ensure that znodes have less than 1M of data I couldn't find any limits on the number of znodes used, only that each znode should only contain little data. We were planning to use a million znodes (each containing a few hundred bytes of data). Would this use case be acceptable for Zookeeper? And if so, does it matter if we have a flat hierarchy (i.e, all nodes have the root node as their direct ancestor) or should we introduce some (artificial) hierarchy levels to have a more tree-like structure? Thanks in advance for your answer. Cheers, Michael
Re: Zookeeper Presentation
there are a bunch of presentations you can grab at http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations ben Mark Vigeant wrote: Hey Everyone, I'm supposed to give a presentation next week about the basic functionality and uses of zookeeper. I was wondering if anybody out there had: 1) A similar presentation that I could use at a starting point for inspiration 2) A cool project they worked on in zookeeper that I can cite as an example of the strength and usefulness of zookeeper. I'm going to also show them the example code and run a few things through the terminal. I am also an HBase user so that is also something I can use to talk about. Thanks a lot for your time and help! Mark Vigeant RiskMetrics Group, Inc. This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.
Re: Struggling with a simple configuration file.
right at the beginning of http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperStarted.html it shows you the minimum standalone configuration. that doesn't explain the 0 id. i'd like to try an reproduce it. do you have an empty data directory with a single file, myid, set to 1? ben Leonard Cuff wrote: I¹ve been developing for ZooKeeper for a couple months now, recently running in a test configuration with 3 ZooKeeper servers. I¹m running 3.2.1 with no problems. Recently I tried to move to a single server configuration for the development team environment, but couldn¹t get the configuration to work. I get the error java.lang.RuntimeException: My id 0 not in the peer list This would seem to imply that the myid file is set to zero. But ...it¹s set to 1. What¹s puzzling to me is my original configuration of servers was this: server.1=ind104.an.dev.fastclick.net:2182:2183 --- The machine I¹m trying to run standalone on. server.2=build101.an.dev.fastclick.net:2182:2183 server.3=cmedia101.an.dev.fastclick.net:2182:2183 I just removed the last two lines, and ran zkServer.sh start. It fails with the described log message. (Full log given below). When I put the server.2 and server.3 lines back in, it works fine, and is following the build101 machine. I decided to try changing the server.1 to server.0, also changed the myid file contents from 1 to zero. I get a very different error scenario: A continuously-occurring Null Pointer exception: 2009-10-09 04:22:36,284 - WARN [QuorumPeer:/0.0.0.0:2181:quorump...@490] - Unexpected exception java.lang.NullPointerException at org.apache.zookeeper.server.quorum.FastLeaderElection.totalOrderPredicate(Fa stLeaderElection.java:466) at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLead erElection.java:635) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488) I¹m at a loss to know where I¹ve gone astray. Thanks in advance for any and all help. Leonard --- the first log 2009-10-09 04:08:58,769 - INFO [main:quorumpeercon...@80] - Reading configuration from: /vcm/home/sandbox/ticket_161758-1/vcm/component/zookeeper/conf/zoo.cfg.dev 2009-10-09 04:08:58,795 - INFO [main:quorumpeerm...@118] - Starting quorum peer 2009-10-09 04:08:58,845 - FATAL [main:quorumpeerm...@86] - Unexpected exception, exiting abnormally java.lang.RuntimeException: My id 0 not in the peer list at org.apache.zookeeper.server.quorum.QuorumPeer.startLeaderElection(QuorumPeer .java:333) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:314) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMa in.java:137) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPee rMain.java:102) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:7 5) -- the second log 2009-10-09 04:22:36,284 - WARN [QuorumPeer:/0.0.0.0:2181:quorump...@490] - Unexpected exception java.lang.NullPointerException at org.apache.zookeeper.server.quorum.FastLeaderElection.totalOrderPredicate(Fa stLeaderElection.java:466) at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLead erElection.java:635) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488) 2009-10-09 04:22:36,285 - INFO [QuorumPeer:/0.0.0.0:2181:quorump...@487] - LOOKING 2009-10-09 04:22:36,285 - INFO [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@579] - New election: 12 2009-10-09 04:22:36,285 - INFO [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@618] - Notification: 0, 12, 43050, 0, LOOKING, LOOKING, 0 2009-10-09 04:22:36,285 - WARN [QuorumPeer:/0.0.0.0:2181:quorump...@490] - Unexpected exception java.lang.NullPointerException at org.apache.zookeeper.server.quorum.FastLeaderElection.totalOrderPredicate(Fa stLeaderElection.java:466) at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLead erElection.java:635) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488) 2009-10-09 04:22:36,286 - INFO [QuorumPeer:/0.0.0.0:2181:quorump...@487] - LOOKING 2009-10-09 04:22:36,286 - INFO [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@579] - New election: 12 2009-10-09 04:22:36,286 - INFO [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@618] - Notification: 0, 12, 43051, 0, LOOKING, LOOKING, 0 2009-10-09 04:22:36,286 - WARN [QuorumPeer:/0.0.0.0:2181:quorump...@490] - Unexpected exception java.lang.NullPointerException at org.apache.zookeeper.server.quorum.FastLeaderElection.totalOrderPredicate(Fa stLeaderElection.java:466) at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLead erElection.java:635) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488) 2009-10-09 04:22:36,286 - INFO [QuorumPeer:/0.0.0.0:2181:quorump...@487] - LOOKING 2009-10-09 04:22:36,287 - INFO
Re: The idea behind 'myid'
can you clarify what you are asking for? are you just looking for motivation? or are you trying to find out how to use it? the myid file just has the unique identifier (number) of the server in the cluster. that number is matched against the id in the configuration file. there isn't much to say about it: http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperStarted.html ben Ørjan Horpestad wrote: Hi! Can someone pin-point me to a site (or please explain ) where I can read about the use of the myid-file for configuring the id of the ZooKeeper servers? I'm sure there is a good reason for using this approach, but it is the first time I have come over this type of non-automatic way for administrating replicas. Regards, Orjan
Re: The idea behind 'myid'
you and ted are correct. the id gives zookeeper a stable identifier to use even if the ip address changes. if the ip address doesn't change, we could use that, but we didn't want to make that a built in assumption. if you really do have a rock solid ip address, you could make a wrapper startup script that starts up and creates the myid file based on the ip address. i gotta say though, i've found that such assumptions are often found to be invalid. ben Eric Bowman wrote: Another way of doing it, though, would be to tell each instance which IP to use at startup. That way the config can be identical for all users, and there can be whatever logic is required to figure out the right IP address, in the place where logic executing anyhow. I do agree that maintaining the myid file is ackward compared to other approaches that are working elsewhere. It's not really clear what purpose the my id serves except to bind an ip address to a running instance. cheers, Eric Ted Dunning wrote: A server doesn't have a unique IP address. Each interface can have 1 or more IP addresses and there can be many interfaces. Furthermore, an IP address can move from one machine to another. 2009/9/25 Ørjan Horpestad orj...@gmail.com Hi Ben Well, im just wondering why the server's own unique IP-address isn't good enough as a valid identifyer; it strikes me to be a bit exhausting to manually set the id for each server in the cluster. Or maybe there is some details im not getting here :-) Regards, Orjan On Fri, Sep 25, 2009 at 3:56 PM, Benjamin Reed br...@yahoo-inc.com wrote: can you clarify what you are asking for? are you just looking for motivation? or are you trying to find out how to use it? the myid file just has the unique identifier (number) of the server in the cluster. that number is matched against the id in the configuration file. there isn't much to say about it: http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperStarted.html ben Ørjan Horpestad wrote: Hi! Can someone pin-point me to a site (or please explain ) where I can read about the use of the myid-file for configuring the id of the ZooKeeper servers? I'm sure there is a good reason for using this approach, but it is the first time I have come over this type of non-automatic way for administrating replicas. Regards, Orjan
Re: How to expire a session
so you have two problems going on. both have the same root: zookeeper_init returns before a connection and session is established with zookeeper, so you will not be able to fill in myid until a connection is made. you can do something with a mutex in the watcher to wait for a connection, or you could do something simple like: while(zoo_state(zh_1) != ZOO_CONNECTED_STATE) { sleep(1); } myid = *zoo_client_id(zh_1); the second part of the problem is related. you need to make sure you are connected before you do the close. ben Leonard Cuff wrote: In the FAQ, there is a question 4. Is there an easy way to expire a session for testing? And the last part of the answer reads: In the case of testing we want to cause a problem, so to explicitly expire a session an application connects to ZooKeeper, saves the session id and password, creates another ZooKeeper handle with that id and password, and then closes the new handle. Since both handles reference the same session, the close on second handle will invalidate the session causing a SESSION_EXPIRED on the first handle. (I assume when it says ³creates another ZooKeeper handle² I¹m assuming it means do that by calling init_zookeeper. Is that correct? Here¹s my skeleton code, which doesn¹t work. ... clientid_t myid; clientid_t another_id; zhandle_tzh_1; zhandle_tzh_2; zoo_deterministic_conn_order(1); zh_1 = zookeeper_init ( servers, watcher, 1, myid, 0, 0); if ( !zh_1 ) { ...error... } // Catch sigusr1 and set the havoc flag if ( cry_havoc_and_let_loose_the_dogs_of_war ) { memcpy ( another_id, myid, sizeof (clientid_t)); zh_2 = zookeeper_init ( servers, destroy_watcher, 1, another_id, 0, 0); if ( ! zh_2 ) { errror ... } if ( !nzh ) { ... error ... } zookeeper_close ( zh_2);// Shouldn't I get a session expire shortly after this? } But I don¹t get a session expire. Can someone tell me what I¹m doing wrong? TIA, Leonard Leonard Cuff lc...@valueclick.com ³This email and any files included with it may contain privileged, proprietary and/or confidential information that is for the sole use of the intended recipient(s). Any disclosure, copying, distribution, posting, or use of the information contained in or attached to this email is prohibited unless permitted by the sender. If you have received this email in error, please immediately notify the sender via return e-mail, telephone, or fax and destroy this original transmission and its included files without reading or saving it in any manner. Thank you.² This email and any files included with it may contain privileged, proprietary and/or confidential information that is for the sole use of the intended recipient(s). Any disclosure, copying, distribution, posting, or use of the information contained in or attached to this email is prohibited unless permitted by the sender. If you have received this email in error, please immediately notify the sender via return email, telephone, or fax and destroy this original transmission and its included files without reading or saving it in any manner. Thank you.
Re: Start problem of Running Replicated ZooKeeper
The connection refused message as opposed to no route to host, or unknown host, indicate that zookeeper has not been started on the other machines. are the other machines giving similar errors? ben Le Zhou wrote: Hi, I'm trying to install HBase 0.20.0 in fully distributed mode on my cluster. As HBase depends on Zookeeper, I have to know first how to make Zookeeper work. I download the release 3.2.1 and install it on each machine in my cluster. Zookeeper in standalone mode works well on each machine in my cluster. I follow the Zookeeper Getting Started Guide and get expected output. Then I come to the Running replicated zookeeper On each machine in my cluster(debian-0, debian-1, debian-5), I append the following lines to zoo.cfg, and create in dataDir a myid which contains the server id(1 for debian-0, 2 for debian-1, 3 for debian-5). server.1=debian-0:2888:3888 server.2=debian-1:2888:3888 server.3=debian-5:2888:3888 then I start zookeeper server by running bin/zkServer.sh start, and I got the following output: cl...@debian-0:~/zookeeper$ bin/zkServer.sh start JMX enabled by default Using config: /home/cloud/zookeeper-3.2.1/bin/../conf/zoo.cfg Starting zookeeper ... STARTED cl...@debian-0:~/zookeeper$ 2009-09-23 15:30:27,976 - INFO [main:quorumpeercon...@80] - Reading configuration from: /home/cloud/zookeeper-3.2.1/bin/../conf/zoo.cfg 2009-09-23 15:30:27,981 - INFO [main:quorumpeercon...@232] - Defaulting to majority quorums 2009-09-23 15:30:28,009 - INFO [main:quorumpeerm...@118] - Starting quorum peer 2009-09-23 15:30:28,034 - INFO [Thread-1:quorumcnxmanager$liste...@409] - My election bind port: 3888 2009-09-23 15:30:28,045 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@487] - LOOKING 2009-09-23 15:30:28,070 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@579] - New election: -1 2009-09-23 15:30:28,075 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@618] - Notification: 1, -1, 1, 1, LOOKING, LOOKING, 1 2009-09-23 15:30:28,075 - WARN [WorkerSender Thread:quorumcnxmana...@336] - Cannot open channel to 2 at election address debian-1/172.20.53.86:3888 java.net.ConnectException: Connection refused at sun.nio.ch.Net.connect(Native Method) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) at java.nio.channels.SocketChannel.open(SocketChannel.java:146) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296) at java.lang.Thread.run(Thread.java:619) 2009-09-23 15:30:28,085 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@642] - Adding vote 2009-09-23 15:30:28,099 - WARN [WorkerSender Thread:quorumcnxmana...@336] - Cannot open channel to 3 at election address debian-5/172.20.14.194:3888 java.net.ConnectException: Connection refused at sun.nio.ch.Net.connect(Native Method) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) at java.nio.channels.SocketChannel.open(SocketChannel.java:146) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296) at java.lang.Thread.run(Thread.java:619) 2009-09-23 15:30:28,288 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorumcnxmana...@336] - Cannot open channel to 2 at election address debian-1/172.20.53.86:3888 java.net.ConnectException: Connection refused at sun.nio.ch.Net.connect(Native Method) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) at java.nio.channels.SocketChannel.open(SocketChannel.java:146) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:356) at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:603) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488) Terminal keeps on outputing the WARN info until I stop the zookeeper Server. I googled zookeeper cannot open channel to at address and searched in mailing list archives, but got nothing helpful. I need your help, thanks and best regards!
RE: A question about Connection timed out and operation timeout
are you using the single threaded or multithreaded C library? the exceeded deadline message means that our thread was supposed to get control after a certain period, but we got control that many milliseconds late. what is your session timeout? ben From: Qian Ye [yeqian@gmail.com] Sent: Thursday, August 20, 2009 3:17 AM To: zookeeper-user Subject: A question about Connection timed out and operation timeout Hi guys: I met the problem again: an ephemeral node disappeared, and I found it because my application got a operation timeout My application which created an ephemeral node at the zookeeper server, printed the following log *WARNING: 08-20 03:09:20: auto * 182894118176 [logid:][reqip:][auto_exchanger_zk_basic.cpp:605]get children fail.[/forum/elect_nodes][-7][operation timeout]* and the Zookeeper client printed the following log (the log level is INFO) 2009-08-19 21:36:18,067:3813(0x9556c520):zoo_i...@log_env@545: Client environment:zookeeper.version=zookeeper C client 3.2.0 606 2009-08-19 21:36:18,067:3813(0x9556c520):zoo_i...@log_env@549: Client environment:host.name=jx-ziyuan-test00.jx.baidu.com 607 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@log_env@557: Client environments.name=Linux 608 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@log_env@558: Client environments.arch=2.6.9-52bs 609 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@log_env@559: Client environments.version=#2 SMP Fri Jan 26 13:34:38 CST 2007 610 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@log_env@567: Client environment:user.name=club 611 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@log_env@577: Client environment:user.home=/home/club 612 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@log_env@589: Client environment:user.dir=/home/club/user/luhongbo/auto-exchanger 613 2009-08-19 21:36:18,068:3813(0x9556c520):zoo_i...@zookeeper_init@613: Initiating client connection, host=127.0.0.1:2181,127.0.0.1:2182sessionTimeout=2000 wa tcher=0x408c56 sessionId=0x0 sessionPasswd=null context=(nil) flags=0 614 2009-08-19 21:36:18,069:3813(0x41401960):zoo_i...@check_events@1439: initiated connection to server [127.0.0.1:2181] 615 2009-08-19 21:36:18,070:3813(0x41401960):zoo_i...@check_events@1484: connected to server [127.0.0.1:2181] with session id=1232c1688a20093 616 2009-08-20 02:48:01,780:3813(0x41401960):zoo_w...@zookeeper_interest@1335: Exceeded deadline by 520ms 617 2009-08-20 03:08:52,332:3813(0x41401960):zoo_w...@zookeeper_interest@1335: Exceeded deadline by 14ms 618 2009-08-20 03:09:04,666:3813(0x41401960):zoo_w...@zookeeper_interest@1335: Exceeded deadline by 48ms 619 2009-08-20 03:09:09,733:3813(0x41401960):zoo_w...@zookeeper_interest@1335: Exceeded deadline by 24ms 620 *2009-08-20 03:09:20,289:3813(0x41401960):zoo_w...@zookeeper_interest@1335: Exceeded deadline by 264ms* 621 *2009-08-20 03:09:20,295:3813(0x41401960):zoo_er...@handle_socket_error_msg@1388: Socket [127.0.0.1:2181] zk retcode=-7, errno=110(Connection timed out): conn ection timed out (exceeded timeout by 264ms)* 622 *2009-08-20 03:09:20,309:3813(0x41401960):zoo_w...@zookeeper_interest@1335: Exceeded deadline by 284ms* 623 *2009-08-20 03:09:20,309:3813(0x41401960):zoo_er...@handle_socket_error_msg@1433: Socket [127.0.0.1:2182] zk retcode=-4, errno=111(Connection refused): server refused to accept the client* 624 *2009-08-20 03:09:20,353:3813(0x41401960):zoo_i...@check_events@1439: initiated connection to server [127.0.0.1:2181]* 625 *2009-08-20 03:09:20,552:3813(0x41401960):zoo_i...@check_events@1484: connected to server [127.0.0.1:2181] with session id=1232c1688a20093* I don't know why the connection timed out happened at *2009-08-20 03:09:20,295:3813, *and the server refuse to accept the client. Could some one give me any hints? And I'm not sure the meaning of Exceeded deadline by xxms, need some help too. P.S. I used the Zookeeper 3.2.0 (Server and C Client API) and run a stand-alone instance Thx all~ -- With Regards! Ye, Qian Made in Zhejiang University
Re: Errors when run zookeeper in windows ?
good point david! zhang can you try david's scripts? we should probably commit those. thanx for pointing them out david. ben David Bosschaert wrote: FWIW, I've uploaded some Windows versions of the zookeeper scripts to https://issues.apache.org/jira/browse/ZOOKEEPER-426 a while ago. They run from the ordinary windows shell, so no need for Cygwin or anything like that. I'm using Zookeeper from Windows all the time and they work fine for me. I did notice that the scripts didn't get included in the latest 3.2.0 release. It might be worth putting some Windows scripts in the next release as nothing in Zookeeper is unix specific (except for the scripts ;) Best regards, David 2009/8/19 zhang jianfeng zjf...@gmail.com: Yes,I am using cygwin and JDK 1.6, the command to start HBase is the same as in the get started: bin/zkServer.sh start The following is the whole message: zjf...@zjf ~/zookeeper-3.1.1 $ *bin/zkServer.sh start* JMX enabled by default Starting zookeeper ... STARTED zjf...@zjf ~/zookeeper-3.1.1 $ java.lang.NoClassDefFoundError: Files\Java\jre6\lib\ext\QTJava/zip;D:\Java\lib\hadoop-0/18/0\build\tools:/home/zjffdu/zookeeper-3/1/1/binzookeeper-3/1/1/jar:/home/zjffdu/zookeeper-3/1/1/binlib/junit-4/4/jar:/home/zjffdu/zookeeper-3/1/1/binlib/log4j-1/2/15/jar:/home/zjffdu/zookeeper-3/1/1/binsrc/java/lib/junit-4/4/jar:/home/zjffdu/zookeeper-3/1/1/binsrc/java/lib/log4j-1/2/15/jar Caused by: java.lang.ClassNotFoundException: Files\Java\jre6\lib\ext\QTJava.zip;D:\Java\lib\hadoop-0.18.0\build\tools:.home.zjffdu.zookeeper-3.1.1.binzookeeper-3.1.1.jar:.home.zjffdu.zookeeper-3.1.1.binlib.junit-4.4.jar:.home.zjffdu.zookeeper-3.1.1.binlib.log4j-1.2.15.jar:.home.zjffdu.zookeeper-3.1.1.binsrc.java.lib.junit-4.4.jar:.home.zjffdu.zookeeper-3.1.1.binsrc.java.lib.log4j-1.2.15.jar at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClassInternal(Unknown Source) Could not find the main class: Files\Java\jre6\lib\ext\QTJava.zip;D:\Java\lib\hadoop-0.18.0\build\tools:/home/zjffdu/zookeeper-3.1.1/bin/../zookeeper-3.1.1.jar:/home/zjffdu/zookeeper-3.1.1/bin/../lib/junit-4.4.jar:/home/zjffdu/zookeeper-3.1.1/bin/../lib/log4j-1.2.15.jar:/home/zjffdu/zookeeper-3.1.1/bin/../src/java/lib/junit-4.4.jar:/home/zjffdu/zookeeper-3.1.1/bin/../src/java/lib/log4j-1.2.15.jar. Program will exit. $ Thank you Jeff zhang On Tue, Aug 18, 2009 at 12:53 PM, Patrick Hunt ph...@apache.org wrote: you are using java 1.6 right? more detail on the class not found would be useful (is that missing or just not included in your email?) Also the command line you're using to start the app would be interesting. Patrick Mahadev Konar wrote: Hi Zhang, Are you using cygwin? mahadev On 8/17/09 11:23 PM, zhang jianfeng zjf...@gmail.com wrote: Hi all, I tried to run zookeeper in windows, but the following errors appears: /* ** * $ java.lang.NoClassDefFoundError: Files\Java\jre6\lib\ext\QTJava/zip;D:\Java\lib\hadoop-0/18/0\build\tools:/home /zjffdu/zookeeper-3/1/1/binzookeeper-3/1/1/jar:/home/zjffdu/zookeeper-3/1/ 1/binlib/junit-4/4/jar:/home/zjffdu/zookeeper-3/1/1/binlib/log4j-1/2/1 5/jar:/home/zjffdu/zookeeper-3/1/1/binsrc/java/lib/junit-4/4/jar:/home/zjf fdu/zookeeper-3/1/1/binsrc/java/lib/log4j-1/2/15/jar Caused by: java.lang.ClassNotFoundException: Files\Java\jre6\lib\ext\QTJava.zip;D:\Java\lib\hadoop-0.18.0\build\tools:.home .zjffdu.zookeeper-3.1.1.binzookeeper-3.1.1.jar:.home.zjffdu.zookeeper-3.1. 1.binlib.junit-4.4.jar:.home.zjffdu.zookeeper-3.1.1.binlib.log4j-1.2.1 5.jar:.home.zjffdu.zookeeper-3.1.1.binsrc.java.lib.junit-4.4.jar:.home.zjf fdu.zookeeper-3.1.1.binsrc.java.lib.log4j-1.2.15.jar at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClassInternal(Unknown Source) Could not find the main class: Files\Java\jre6\lib\ext\QTJava.zip;D:\Java\lib\hadoop-0.18.0\build\tools:/home /zjffdu/zookeeper-3.1.1/bin/../zookeeper-3.1.1.jar:/home/zjffdu/zookeeper-3.1. 1/bin/../lib/junit-4.4.jar:/home/zjffdu/zookeeper-3.1.1/bin/../lib/log4j-1.2.1
RE: exist return true before event comes in
I assume you are calling the synchronous version of exists. The callbacks for both the watches and async calls are processed by a callback thread, so the ordering is strict. Synchronous call responses are not queued to the callback thread. (this allows you to make synchronous calls in callbacks without deadlocking.) thus the effect you are seeing may be due to a backed up callback queue and/or thread scheduling. ben Sent from my phone. -Original Message- From: Stefan Groschupf s...@101tec.com Sent: Monday, August 03, 2009 9:31 PM To: zookeeper-user@hadoop.apache.org zookeeper-user@hadoop.apache.org Subject: exist return true before event comes in Hi, I'm running into following problem writing a fasade for Zk Client (http://github.com/joa23/zkclient/ ) 1.) Subscribe a watch via exist(path, true) for a path. 2.) Create a persistent node. 3.) Call exist and it returns true 4.) Zookeeper sends a NodeCreated event. I would expect that the client would get the NodeCreated event before exist returns true. Does anyone has a idea of a pattern that secures that exist return false, before the event is triggered? Thanks, Stefan
RE: c client header location
Or maybe /usr/local/include/zookeeper but either way c-client-src is weird. Please open a jira. Thanx ben Sent from my phone. -Original Message- From: Michi Mutsuzaki mi...@cs.stanford.edu Sent: Saturday, August 01, 2009 6:15 PM To: zookeeper-user@hadoop.apache.org zookeeper-user@hadoop.apache.org Subject: c client header location Hello, Why do the headers get installed in /usr/local/include/c-client-src? Shouldn't it go to /usr/local/include? Thanks! --Michi
Re: Question about the sequential flag on create.
the create is atomic. we just use a data structure that does not store the list of children in order. ben Erik Holstad wrote: Hey Patrik! Thanks for the reply. I understand all the reasons that you posted above and totally agree that nodes should not be sorted since you then have to pay that overhead for every node, even though you might not need or want it. I just thought that it might be possible to create a sequential node atomically, but I guess that is not how it works? Regards Erik
Re: Confused about KeeperState.Disconnected and KeeperState.Expired
sorry to jump in late. if i understand the scenario correctly, you are partitioned from ZK, but you still have access to the NN on which you are holding leases to files. the problem is that even though your ephemeral nodes may timeout, you are still holding a lease on the NN and recovery would go faster if you actually closed the file. right? or is it deeper than that? can you open a file in such a way that you stomp the lease? or make sure that the lease timeout is smaller than the session timeout and only renew if you are still connected to ZK? thanx ben Jean-Daniel Cryans wrote: If the machine was completely partitioned, as far as I know, it would lose it's lease so the only thing we have to make sure about is clearing the state of the region server by doing a restart so that it's ready to come back in the cluster. If ZK is down but the rest is up, closing the files in HDFS should ensure that we lose a minimum of data if not losing any. I think that in a multi-rack setup it is possible to not be able to talk to ZK but to be able to talk to the Namenode as machines can be anywhere. Especially in HBase 0.20, the master can failover on any node that has a backup Master ready. So in that case, the region server should consider itself gone from the cluster and close any connection it has and restart. Those are very legetimate questions Gustavo, thanks for asking. J-D On Wed, Jun 24, 2009 at 3:38 PM, Gustavo Niemeyer gust...@niemeyer.netwrote: Ben's opinion is that it should not belong in the default API but in the common client that another recent thread was about. My opinion is just that I need such a functionality, wherever it is. Understood, sorry. I just meant that it feels like something that would likely be useful to other people too, so might have a role in the default API to ensure it gets done properly considering the details that Ben brought up. If the node gets the exception (or has it's own timer), as I wrote, it will shut itself down to release HDFS leases as fast as possible. If ZK is really down and it's not a network partition, then HBase is down and this is fine because it won't be able to work anyway. Right, that's mostly what I was wondering. I was pondering about under which circumstances the node would be unable to talk to the ZooKeeper server but would still be holding the HDFS lease in a way that prevented the rest of the system from going on. If I understand what you mean, if ZooKeeper is down entirely, HBase would be down for good. If the machine was partitioned off entirely, the HDFS side of things will also be disconnected, so shutting the node down won't help the rest of the system recovering. -- Gustavo Niemeyer http://niemeyer.net
Re: Some thoughts on Zookeeper after using it for a while in the CXF/DOSGi subproject
this is great to hear. it's great to see siblings playing together ;) * In CXF we use Maven to build everything. To depend on Zookeeper we need to pull it in from a Maven repository. I couldn't find Zookeeper in any main Maven repos, so currently we're pulling it in from http://people.apache.org/~chirino/zk-repo (a private repo), which is not ideal. Would there be any chance of getting the zookeeper.jar file deployed to one of the main Maven repo's (e.g. http://repo2.maven.org/maven2/)? yeah this is an increasing thorn in our side. some of us would like to convert to maven, but we are tied to the hadoop build process since we reuse all of their build/test infrastructure. we will probably be using ivy to connect to maven repositories. * To use Zookeeper from within OSGi it has to be turned into an OSGi bundle. Doing this is not hard and it's currently done in our buildsystem [1]. However, I think it would make sense to have this done somewhere in the Zookeeper buildsystem. Matter of fact I think you should be able to release a single zookeeper.jar that's both an ordinary jar and an OSGi bundle so it would work in both cases... i completely agree. please open a jira and submit a patch. * The Zookeeper server is currently started with the zkServer.sh script, but I think it would make sense to also allow it to run inside an OSGi container, simply by starting a bundle. Has anyone ever done any work in this regard? If not I'm planning to spend some time and try to make this work. we have a current open jira about making it possible to embed the zookeeper server in other applications. the big problem is the System.exits that we have sprinkled around. it shouldn't be hard to make happen since we start and stop the server in our unit tests. * BTW I made some Windows versions of the zkCli/zkEnv/zkServer scripts. Interested in taking these? excellent. please submit a jira and patch! i'm so glad you are working on this. i've been thinking for a long time that ZooKeeper would fit really well with OSGi, but i haven't had time to work on it. thank you! ben
RE: NodeChildrenChanged WatchedEvent
good summary ted. just to add a bit. another motivation for the current design is what scott had mentioned earlier: not sending a flood of changes when the value of a node is changing rapidly. implicit in this is the fact that we do not send the value in the events. not only does this make the events much more heavy weight, but it also leads to bad programming practices (see the faq). since we don't send data in the events, sending 3 data changed events in a row is the same as just sending the last data changed event. i also agree with ted about the wrappers. unless they are used to implement a new construct, usually they just introduce bugs. however, there are two things i want to point out. first, the current exception handling ranges from a pain to, in the case of create() with SEQUENTIAL and EPHEMERAL, almost impossible, so we want to make connecting recovery a bit more sophisticated; when a connection goes down, the client and server figure out what happen to the pending requests so that we never need to error them out with the i have no idea what happened exception, aka CONNECTION LOSS. second, higher level constructs in the form of recipes are great! for more sophisticated constructs it is great to have things implemented once and thoroughly debugged. ben ps - one other clarification in ZK 3, the watches are still tracked locally. it's just that in ZK 3 the client now has the ability to tell the server what it was watching and what was the last thing seen when it reconnects. the server can then figure out which watches were missed and need to be retriggered and which watches need to be reregistered __ From: Ted Dunning [ted.dunn...@gmail.com] Sent: Saturday, May 09, 2009 1:06 PM To: zookeeper-user@hadoop.apache.org Subject: Re: NodeChildrenChanged WatchedEvent Making things better is always good. I have found that in practice, most wrappers of ZK lead to serious errors and should be avoided like the plague. This particular use case is not a big deal for me to code correctly (in Java, anyway) and I do it all the time. It may be that the no-persistent-watch policy was partly an artifact of the ZK 1 and ZK 2 situation where ZK avoided keeping much of anything around per session other than ephemeral files. This has changed in ZK 3 and it might be plausible to have more persistent watches. On the other hand, I believe that Ben purposely avoided having this type of watch to automatically throttle the number of notifications to be equal to the rate at which the listener can handle them. Having seen a number of systems that didn't throttle this way up close and personal, I have lots of empathy which that position. Since I don't have any issue with looking at for changes, I would tend to just go with whatever Ben suggests. His opinions (largely based on watching people code with ZK) are pretty danged good. On Sat, May 9, 2009 at 12:37 PM, Scott Carey sc...@richrelevance.comwrote: What I am suggesting are higher level constructs that do these repeated mundane tasks for you to handle those use cases where the verbosity of the API is a hinderance to quality and productivity. -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)
RE: Moving ZooKeeper Servers
yes, /zookeeper is part of the reserved namespace for zookeeper internals. you should ignore it for such things. ben From: Satish Bhatti [cthd2...@gmail.com] Sent: Wednesday, May 06, 2009 2:57 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Moving ZooKeeper Servers I ended up going with that suggestion, a short recursive function did the trick! However, I noticed the following nodes: /zookeeper /zookeeper/quota that were not created by me. So I ignored them. Is this correct? Satish On Mon, May 4, 2009 at 4:33 PM, Ted Dunning ted.dunn...@gmail.com wrote: In fact, the much, much simpler approach of bringing up the production ZK cluster and simply writing a program to read from the pre-production cluster and write to the production one is much more sound. If you can't do that, you may need to rethink your processes since they are likely to be delicate for other reasons as well. On Mon, May 4, 2009 at 2:35 PM, Mahadev Konar maha...@yahoo-inc.com wrote: So, zookeeper would work fine if you are careful with above but I would vote against doing this for production since the above is pretty easy to mess up. -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)
Re: Unique Id Generation
i'm not exactly clear how you use these ideas, but one source of unique ids that are longs is the zxid. if you create a znode, everytime you write to it, you will get a unique zxid in the mzxid member of the stat structure. (you get the stat structure back in the response to the setData.) ben Mahadev Konar wrote: Hi Satish, Most of the sequences (versions of nodes ) and the sequence flags are ints. We do have plans to move it to long. But in your case I can imagine you can split a long into 2 32 bits - Parent (which is int) - child(which is int) Now after you run out of child epehemarls then you should create a node Parent + 1 Remove parent And then start creating an ephemeral child (so parent (32 bits) and child (32 bits)) would form a long. I don't think this should be very hard to implement. Their is nothing in zookeeper (out of the box) currently that would help you out. Mahadev On 4/23/09 4:52 PM, Satish Bhatti cthd2...@gmail.com wrote: We currently use a database sequence to generate unique ids for use by our application. I was thinking about using ZooKeeper instead so I can get rid of the database. My plan was to use the sequential id from ephemeral nodes, but looking at the code it appears that this is an int, not a long. Is there any other straightforward way to generate ids using ZooKeeper? Thanks, Satish
RE: Semantics of ConnectionLoss exception
it is possible for the time to pass without the session expiring. Imagine a session timeout of 15 seconds. there is correlated power outage affecting the zookeeper servers. lets say it takes 5 minutes to recover power and reboot. when the service recovers, it resets expiration times, so when the servers start back up and the client reconnects (assuming it is retrying every few seconds), the session will be recovered and everything will proceed as normal. if the client library generates a session expired, the client could connect with a new session after the service recovers and see its own ephemeral nodes for 15 seconds. ben From: Nitay [nit...@gmail.com] Sent: Thursday, March 26, 2009 12:09 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Semantics of ConnectionLoss exception Why is it done that way? How am I supposed to reliably detect that my ephemeral nodes are gone? Why not deliver the Session Expired event on the client side after the right time has passed without communication to any server? On Thu, Mar 26, 2009 at 10:58 AM, Mahadev Konar maha...@yahoo-inc.comwrote: Isn't it the case that the client won't get session expired until it's able to connect to a server, right? So what might happen is that the client loses connection to the server, the server eventually expires the client and deletes ephemerals (notifying all watchers) but the client won't see the session expiration until it is able to reconnect to one of the servers. ie the client doesn't know it's been expired until it's able to reconnect to the cluster, at which point it's notified that it's been expired. You are right pat! mahadev http://hadoop.apache.org/zookeeper/docs/r3.0.1/zookeeperProgrammers.html Has this information scattered around, but we should put it in the FAQ specifically. 3.0.1 is a bit old, try this for the latest docs: http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html - Is the ZooKeeper handle I'm using dead after this event? Again no. your handle is valid until you get an session expiry event or you do a zoo_close on your handle. Thanks mahadev On 3/25/09 5:42 PM, Nitay nit...@gmail.com wrote: I'm a little unclear about the ConnectionLoss exception as it's described in the FAQ and would like some clarification. From the state diagram, http://wiki.apache.org/hadoop/ZooKeeper/FAQ#1, there are three events that cause a ConnectionLoss: 1) In Connecting state, call close(). 2) In Connected state, call close(). 3) In Connected state, get disconnected. It's the third one I'm unclear about. - Does this event happening mean my ephemeral nodes will go away? - Is the ZooKeeper handle I'm using dead after this event? Meaning that, similar to the SessionExpired case, I need to construct a new connection handle to ZooKeeper and take care of the restarting myself. It seems from the diagram that this should not be the case. Rather, seeing as the disconnected event sends the user back to the Connecting state, my handle should be fine and the library will keep trying to reconnect to ZooKeeper internally? I understand my current operation may have failed, what I'm asking about is future operations. Thanks, -n
Re: How large an ensemble can one build with Zookeeper?
I realize this is discussion is over, but i did want to make one quick clarification. when we talk about ensembles, we are talking about the servers that make up the zookeeper service. we refer to the servers that use the zookeeper service as clients. we have systems here that use ensembles of five servers to provide zookeeper service to thousands of client servers without problem. ben Chad Harrington wrote: Clearly Zookeeper can handle ensembles of a dozen or so servers. How large an ensemble can one build with Zookeeper? 100 servers? 10,000 servers? Are there limitations that make the system unusable at large numbers of servers? Thanks,
Re: Contrib section (nee Re: A modest proposal for simplifying zookeeper :)
i'm ready to reevaluate it. i did the contrib for fatjar and it was extremely painful! (and that was an extremely simple contrib!) we really want to ramp up the contribs and get a bunch of recipe implementations in, so we need something that makes it really easy. i'm not a fan of maven (they seem to have chosen a convention that is convenient for the build tool rather the developer), but it is widely used and i we need something better, so i'm certainly considering it. ben Anthony Urso wrote: Speaking of the contrib section, what is the status of ZOOKEEPER-103? Is it ready to be reevaluated now that 3.0 is out? Cheers, Anthony On Fri, Jan 9, 2009 at 11:58 AM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Kevin, It would be great to have such high level interfaces. It could be something that you could contribute :) . We havent had the bandwidth to provide such interfaces for zookeeper. It would be great to have all such recipes as a part of contrib package of zookeeper. mahadev On 1/9/09 11:44 AM, Kevin Burton bur...@spinn3r.com wrote: OK so it sounds from the group that there are still reasons to provide rope in ZK to enable algorithms like leader election. Couldn't ZK ship higher level interfaces for leader election, mutexes, semapores, queues, barriers, etc instead of pushing this on developers? Then the remaining APIs, configuration, event notification, and discovery, can be used on a simpler, rope free API. The rope is what's killing me now :) Kevin
Re: Contrib section (nee Re: A modest proposal for simplifying zookeeper :)
just to be clear: i'm not a maven fan, but i'm not sure anything else is better. buildr looks better flexibility wise, but i think maven is much more popular and mature. with ivy we are still stuck with ant build files. ben Patrick Hunt wrote: Ben, you might want to look at buildr, it recently graduated from the apache incubator: http://buildr.apache.org/ Buildr is a build system for Java applications. We wanted something that’s simple and intuitive to use, so we only need to tell it what to do, and it takes care of the rest. But also something we can easily extend for those one-off tasks, with a language that’s a joy to use. And of course, we wanted it to be fast, reliable and have outstanding dependency management. Also Ivy just released version 2.0. If you have a specific idea and would like to start working on this please create a JIRA to discuss/track/vote/etc... Be aware that the contribution process, release process and other documentation would have to be updated as part of this. For example if we want to push jars to an artifact repo the artifacts/pom/etc... would have to be voted on as part of the release process. Patrick Benjamin Reed wrote: i'm ready to reevaluate it. i did the contrib for fatjar and it was extremely painful! (and that was an extremely simple contrib!) we really want to ramp up the contribs and get a bunch of recipe implementations in, so we need something that makes it really easy. i'm not a fan of maven (they seem to have chosen a convention that is convenient for the build tool rather the developer), but it is widely used and i we need something better, so i'm certainly considering it. ben Anthony Urso wrote: Speaking of the contrib section, what is the status of ZOOKEEPER-103? Is it ready to be reevaluated now that 3.0 is out? Cheers, Anthony On Fri, Jan 9, 2009 at 11:58 AM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Kevin, It would be great to have such high level interfaces. It could be something that you could contribute :) . We havent had the bandwidth to provide such interfaces for zookeeper. It would be great to have all such recipes as a part of contrib package of zookeeper. mahadev On 1/9/09 11:44 AM, Kevin Burton bur...@spinn3r.com wrote: OK so it sounds from the group that there are still reasons to provide rope in ZK to enable algorithms like leader election. Couldn't ZK ship higher level interfaces for leader election, mutexes, semapores, queues, barriers, etc instead of pushing this on developers? Then the remaining APIs, configuration, event notification, and discovery, can be used on a simpler, rope free API. The rope is what's killing me now :) Kevin
RE: Recommended session timeout
just a quick sanity check. are you sure your memory is not overcommitted? in other words you aren't swapping. since the gc does a bunch of random memory accesses if you swap at all things will go very slow. ben From: Joey Echeverria [joe...@gmail.com] Sent: Thursday, February 26, 2009 1:31 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Recommended session timeout I've answered the questions you asked previously below, but I thought I would open with the actual culprit now that we found it. When I said loading data before, what I was talking about was sending data via Thrift to the machine that was getting disconnected from zookeeper. This turned out to be the problem. Too much data was being sent in short span of time and this caused memory pressure on the heap. This increased the fraction of the time that the GC had to run to keep up. During a 143 second test, the GC was running for 33 seconds. We found this by running tcpdump on both the machine running the ensemble server and the machine connecting to zookeeper as a client. We deduced it wasn't a network (lost packet) issue, as we never saw unmatched packets in our tests. What did see were long 2-7 second pauses with no packets being sent. We first attempted to up the priority of the zookeeper threads to see if that would help. When it didn't, we started monitoring the GC time. We don't have a work around yet, other than sending data in smaller batches and using a longer sessionTimeout. Thanks for all your help! -Joey As an experiment try increasing the timeout to say 30 seconds and re-run your tests. Any change? 30 seconds and higher works fine. loading data - could you explain a bit more about what you mean by this? If you are able to provide enough information for us to replicate we could try it out (also provide info on your ensemble configuration as Mahadev suggested) The ensemble config file looks as follows: tickTime=2000 dataDir=/data/zk clientPort=2181 initLimit=5 syncLimit=2 skipACL=true server.1=server1:2888:3888 ... server.7=server7:2888:3888 You are referring to startConnect in SendThread? We randomly sleep up to 1 second to ensure that the clients don't all storm the server(s) after a bounce. That makes some sense, but it might be worth tweaking that parameter based on sessionTimeout since 1 second can easily be 10-20% of sessionTimeout. 1) configure your test client to connect to 1 server in the ensemble 2) run the srst command on that server 3) run your client test 4) run the stat command on that server 5) if the test takes some time, run the stat a few times during the test to get more data points The problem doesn't appear to be on the server end as max latency never went above 5ms. Also, no messages are shown as queued.
RE: Dealing with session expired
idleness is not a problem. the client library sends heartbeats to keep the session alive. the client library will also handle reconnects automatically if a server dies. since session expiration really is a rare catastrophic event. (or at least it should be.) it is probably easiest to deal with it by starting with a fresh instance if your session expires. ben From: Tom Nichols [tmnich...@gmail.com] Sent: Thursday, February 12, 2009 11:53 AM To: zookeeper-user@hadoop.apache.org Subject: Re: Dealing with session expired I'm using a timeout of 5000ms. Now let me ask this: Suppose all of my clients are waiting on some external event -- not ZooKeeper -- so they are all idle and are not touching ZK nodes, nor are they calling exists, getChildren, etc etc. Can that idleness cause session expiry? I'm running a local quorum of 3 nodes. That is, I have an Ant script that kicks off 3 java tasks in parallel to run ConsumerPeerMain, each with its own config file. Regarding handling of the failure, I suspect I will just have to reinitialize by creating a new instance of my client(s) that themselves will have a new ZK instance. I'm using Spring to wire everything together, which is why it's particularly difficult to simply re-create a new ZK instance and pass it to the classes using it (those classes have no knowledge of each other). But I _can_ just pull a freshly-created (prototype) instance from the Spring application context, which is where a new ZK client will be wired in. The only ramification there is I have to throw the KeeperException as a fatal exception rather than letting that client try to re-elect. Or maybe add in some logic to say if I can't re-elect, _then_ throw an exception and consider it fatal. Thanks guys. -Tom On Thu, Feb 12, 2009 at 2:39 PM, Patrick Hunt ph...@apache.org wrote: Regardless of frequency Tom's code still has to handle this situation. I would suggest that the two classes Tom is referring to in his mail, the ones that use ZK client object, should either be able to reinitialize with a new zk session, or they themselves should be discarded and new instances created using the new session (not sure what makes more sense for his archi...) Regardless of whether we reuse the session object or create a new one I believe the code using the session needs to reinitialize in some way -- there's been a dramatic break from the cluster. As I mentioned, you can decrease the likelihood of expiration by increasing the timeout - but the downside is that you are less sensitive to clients dying (because their ephemeral nodes don't get deleted till close/expire and if you are doing something like leader election among your clients it will take longer for the followers to be notified). Patrick Mahadev Konar wrote: Hi Tom, The session expired event means that the the server expired the client and that means the watches and ephemrals will go away for that node. How are you running your zookeeper quorum? Session expiry event should be really rare event . If you have a quorum of servers it should rarely happen. mahadev On 2/12/09 11:17 AM, Tom Nichols tmnich...@gmail.com wrote: So if a session expires, my ephemeral nodes and watches have already disappeared? I suppose creating a new ZK instance with the old session ID would not do me any good in that case. Correct? Thanks. -Tom On Thu, Feb 12, 2009 at 2:12 PM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Tom, We prefer to discard the zookeeper instance if a session expires. Maintaining a one to one relationship between a client handle and a session makes it much simpler for users to understand the existence and disappearance of ephemeral nodes and watches created by a zookeeper client. thanks mahadev On 2/12/09 10:58 AM, Tom Nichols tmnich...@gmail.com wrote: I've come across the situation where a ZK instance will have an expired connection and therefore all operations fail. Now AFAIK the only way to recover is to create a new ZK instance with the old session ID, correct? Now, my problem is, the ZK instance may be shared -- not between threads -- but maybe two classes in the same thread synchronize on different nodes by using different watchers. So it makes sense that one ZK client instance can handle this. Except that even if I detect the session expiration by catching the KeeperException, if I want to resume the session, I have to create a new ZK instance and pass it to any classes who were previously sharing the same instance. Does this make sense so far? Anyway, bottom line is, it would be nice if a ZK instance could itself recover a session rather than discarding that instance and creating a new one. Thoughts? Thanks in advance, -Tom
RE: Delaying 3.1 release by 2 to 3 weeks?
we should delay. it would be good to try out quotas for a bit before we do the release. quotas are also a key part of the release. 3 weeks seem a little long though. ben From: Mahadev Konar [maha...@yahoo-inc.com] Sent: Thursday, January 15, 2009 4:32 PM To: zookeeper-...@hadoop.apache.org Cc: zookeeper-user@hadoop.apache.org Subject: Re: Delaying 3.1 release by 2 to 3 weeks? That was release 3.1 and not 3.2 :) mahadev On 1/15/09 4:26 PM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi all, I needed to get quotas in zookeeper 3.2.0 and wanted to see if delaying the release by 2-3 weeks is ok with everyone? Here is the jira for it - http://issues.apache.org/jira/browse/ZOOKEEPER-231 Please respond if you have any issues with the delay. thanks mahadev
RE: Distributed queue: how to ensure no lost items?
That is a good point. you could put a child znode of queue-X that contains the processing history. Like who tried to process and what time they started. ben From: Hiram Chirino [chir...@gmail.com] Sent: Monday, January 12, 2009 8:48 AM To: zookeeper-user@hadoop.apache.org Subject: Re: Distributed queue: how to ensure no lost items? At least once is generally the case in queuing systems unless you can do a distributed transaction with your consumer. What comes in handy in an at least once case, is letting the consumer know that a message may have 'potentially' already been processed. That way he can double check first before he goes off and processes the message again. But adding that info in ZK might be more expensive that doing the double check every time in consumer anyways. On Thu, Jan 8, 2009 at 11:42 AM, Benjamin Reed br...@yahoo-inc.com wrote: We should expand that section. the current queue recipe guarantees that things are consumed at most once. to guarantee at least the consumer creates an ephemeral node queue-X-inprocess to indicate that the node is being processed. once the queue element has been processed the consumer deletes queue-X and queue-X-inprocess (in that order). using an emphemeral node means that if a consumer crashes, the *-inprocess node will be deleted allowing the queue elements it was working on to be consumed by someone else. putting the *-inprocess nodes at the same level of the queue-X nodes allows the consumer to get the list of queue elements and the inprocess flags with the same getChildren call. the *-inprocess flag ensures that only one consumer is processing a given item. by deleting queue-X before queue-X-inprocess we make sure that no other consumer will see queue-X as available for consumption after it is processed and before it is deleted. this is at last once, because the consumer has a race condition. the consumer may process the item and then crash before it can delete the corresponding queue-X node. ben -Original Message- From: Stuart White [mailto:stuart.whi...@gmail.com] Sent: Thursday, January 08, 2009 7:15 AM To: zookeeper-user@hadoop.apache.org Subject: Distributed queue: how to ensure no lost items? I'm interested in using ZooKeeper to provide a distributed producer/consumer queue for my distributed application. Of course I've been studying the recipes provided for queues, barriers, etc... My question is: how can I prevent packets of work from being lost if a process crashes? For example, following the distributed queue recipe, when a consumer takes an item from the queue, it removes the first item znode under the queue znode. But, if the consumer immediately crashes after removing the item from the queue, that item is lost. Is there a recipe or recommended approach to ensure that no queue items are lost in the event of process failure? Thanks! -- Regards, Hiram Blog: http://hiramchirino.com Open Source SOA http://open.iona.com
RE: Updated NodeWatcher...
I'm really bad a creating figures, but i've put up something that should be informative. (i'm also really bad at apache wiki.) hopefully someone can make it more beautiful. i've added the state diagram to the FAQ: http://wiki.apache.org/hadoop/ZooKeeper/FAQ ben -Original Message- From: adam.ros...@gmail.com [mailto:adam.ros...@gmail.com] On Behalf Of Adam Rosien Sent: Thursday, January 08, 2009 8:06 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Updated NodeWatcher... It feels like we need a flowchart, state-chart, or something, so we can all talk about the same thing. Then people could suggest abstractions that would essentially put a box around sections of the diagram. However I feel woefully inadequate at the former :(. .. Adam On Thu, Jan 8, 2009 at 4:20 PM, Benjamin Reed br...@yahoo-inc.com wrote: For your first issue if an ensemble goes offline and comes back, everything should be fine. it will look to the client just like a server went down. if a session expires, you are correct that the client will not reconnect. this again is on purpose. for the node watcher the session is unimportant, but if the ZooKeeper object is also being used for leader election, for example, you do not want the object to grab a new session automatically. For 2) i think pat responded to that one. an async request will always return. if the server goes down after the request is issued, you will get a connection loss error in your callback. Your third issued is described with the first. ben -Original Message- From: burtona...@gmail.com [mailto:burtona...@gmail.com] On Behalf Of Kevin Burton Sent: Thursday, January 08, 2009 4:02 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Updated NodeWatcher... i just found that part of this thread went to my junk folder. can you send the URL for the NodeListener? Sure... here you go: http://pastebin.com/f1e9d3706 this NodeWatcher is a useful thing. i have a couple of suggestions to simplify it: 1) Construct the NodeWatcher with a ZooKeeper object rather than constructing one. Not only does it simplify NodeWatcher, but it also makes it so that the ZooKeeper object can be used for other things as well. I hear you I was thinking that this might not be a good idea because NodeWatcher can reconnect you to the ensemble if it goes offline. I'm not sure if it's a bug or not but once my session expired on the client it wouldn't reconnect so I just implemented my own reconnect and session expiry. 2) Use the async API in watchNodeData and watchNodeExists. it simplifies the code and the error handling. The problem was that according to feedback here an async request might never return if the server dies shortly after the request and before it has a change to respond. I wanted NodeWatcher to hide as much rope as possible. 3) You don't need to do a connect() in handleDisconnected(). ZooKeeper object will do it automatically for you. I can try again if you'd like by this isn't my experience. Once the session expired and the whole ensemble was offline it wouldn't connect again. If it was a transient disconnect I'd see on disconnect event and then a quick reconnect. If it was a long disconnect (with nothing to attach to) then ZK won't ever reconnect me. I'd like this to be the behavior though... There is an old example on sourceforge http://zookeeper.wiki.sourceforge.net/ZooKeeperJavaExample that may give you some more ideas on how to simplify your code. That would be nice simple is good! Kevin -- Founder/CEO Spinn3r.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com
RE: Updated NodeWatcher...
yeah, i was thinking it should be in forrest, but i couldn't figure out where to put it. that is why i didn't close the issue. ben -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Friday, January 09, 2009 9:37 AM To: zookeeper-user@hadoop.apache.org Subject: Re: Updated NodeWatcher... Ben this is great, thanks! Do you want to close out this one and point to the faq? https://issues.apache.org/jira/browse/ZOOKEEPER-264 Although IMO this should be moved to the forrest docs. Patrick Benjamin Reed wrote: I'm really bad a creating figures, but i've put up something that should be informative. (i'm also really bad at apache wiki.) hopefully someone can make it more beautiful. i've added the state diagram to the FAQ: http://wiki.apache.org/hadoop/ZooKeeper/FAQ ben -Original Message- From: adam.ros...@gmail.com [mailto:adam.ros...@gmail.com] On Behalf Of Adam Rosien Sent: Thursday, January 08, 2009 8:06 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Updated NodeWatcher... It feels like we need a flowchart, state-chart, or something, so we can all talk about the same thing. Then people could suggest abstractions that would essentially put a box around sections of the diagram. However I feel woefully inadequate at the former :(. .. Adam On Thu, Jan 8, 2009 at 4:20 PM, Benjamin Reed br...@yahoo-inc.com wrote: For your first issue if an ensemble goes offline and comes back, everything should be fine. it will look to the client just like a server went down. if a session expires, you are correct that the client will not reconnect. this again is on purpose. for the node watcher the session is unimportant, but if the ZooKeeper object is also being used for leader election, for example, you do not want the object to grab a new session automatically. For 2) i think pat responded to that one. an async request will always return. if the server goes down after the request is issued, you will get a connection loss error in your callback. Your third issued is described with the first. ben -Original Message- From: burtona...@gmail.com [mailto:burtona...@gmail.com] On Behalf Of Kevin Burton Sent: Thursday, January 08, 2009 4:02 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Updated NodeWatcher... i just found that part of this thread went to my junk folder. can you send the URL for the NodeListener? Sure... here you go: http://pastebin.com/f1e9d3706 this NodeWatcher is a useful thing. i have a couple of suggestions to simplify it: 1) Construct the NodeWatcher with a ZooKeeper object rather than constructing one. Not only does it simplify NodeWatcher, but it also makes it so that the ZooKeeper object can be used for other things as well. I hear you I was thinking that this might not be a good idea because NodeWatcher can reconnect you to the ensemble if it goes offline. I'm not sure if it's a bug or not but once my session expired on the client it wouldn't reconnect so I just implemented my own reconnect and session expiry. 2) Use the async API in watchNodeData and watchNodeExists. it simplifies the code and the error handling. The problem was that according to feedback here an async request might never return if the server dies shortly after the request and before it has a change to respond. I wanted NodeWatcher to hide as much rope as possible. 3) You don't need to do a connect() in handleDisconnected(). ZooKeeper object will do it automatically for you. I can try again if you'd like by this isn't my experience. Once the session expired and the whole ensemble was offline it wouldn't connect again. If it was a transient disconnect I'd see on disconnect event and then a quick reconnect. If it was a long disconnect (with nothing to attach to) then ZK won't ever reconnect me. I'd like this to be the behavior though... There is an old example on sourceforge http://zookeeper.wiki.sourceforge.net/ZooKeeperJavaExample that may give you some more ideas on how to simplify your code. That would be nice simple is good! Kevin -- Founder/CEO Spinn3r.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com
RE: Can ConnectionLossException be thrown when using multiple hosts?
just to clarify: you also get ConnectionLossException from syncronous requests if the request cannot be sent or no response is received. ben -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Wednesday, January 07, 2009 10:16 AM To: zookeeper-user@hadoop.apache.org Subject: Re: Can ConnectionLossException be thrown when using multiple hosts? There are basically 2 cases where you can see connectionloss: 1) you call an operation on a session that is no longer alive 2) you are disconnected from a server when there are pending async operations to that server (you made an async request which has not yet completed) Patrick Kevin Burton wrote: Can this be thrown when using multiple servers as long as 1 of them is online? Trying to figure out of I should try some type of reconnect if a single machine fails instead of failing altogether. Kevin
RE: Sending data during NodeDataChanged or NodeCreated
if you do a getData(/a, true) and then /a changes, you will get a watch event. if /a changes again, you will not get an event. so, if you want to monitor /a, you need to do a new getData() after each watch event to reregister the watch and get the new value. (re-registering watches on reconnect is a different issue. there are no disconnects in this example.) you are correct that zookeeper has some subtle things to watch out for. that is why we do not want to add more. ben -Original Message- From: burtona...@gmail.com [mailto:burtona...@gmail.com] On Behalf Of Kevin Burton Sent: Thursday, January 08, 2009 11:58 AM To: zookeeper-user@hadoop.apache.org Subject: Re: Sending data during NodeDataChanged or NodeCreated while the case in which a value only changes once, can be made slightly more optimal by passing the value in the watch event. it is not worth the risk. in our experience we had a application that was able to make that assumption initially and then later when the assumption became invalid it was very hard to diagnose. I don't quite follow. In this scenario you would be sent two events, with two pieces of data. If ZK re-registers watches on reconnect, I don't see how it could be easier than this. we don't want to make zookeeper harder to use by introducing mechanisms that only work with subtle assumptions. I definitely think ZK has too much rope right now. It's far too easy to make mistakes and there are lots of subtle undocumented behaviors. Kevin -- Founder/CEO Spinn3r.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com
RE: Simpler ZooKeeper event interface....
when you shutdown the full ensemble the session isn't expired. when things come back up your session will still be active. (it would be bad if the zk service could not survive the bounce of an ensembel.) you are way over thinking this and i fear you are not helping yourself with trying to second guess with timers. zookeeper is structured such it can be used as ground truth. trying to second guess will only bring you headache. ben From: burtona...@gmail.com [burtona...@gmail.com] On Behalf Of Kevin Burton [bur...@spinn3r.com] Sent: Wednesday, January 07, 2009 3:36 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Simpler ZooKeeper event interface Here's a good reason for each client to know it's session status (connected/disconnected/expired). Depending on the application, if L does not have a connected session to the ensemble it may need to be careful how it acts. connected/disconnected events are given out in the current API but when I shutdown the full ensemble I don't receive a session expired. I'm considering implementing my own session expiration by tracking how long I've been disconnected. Kevin -- Founder/CEO Spinn3r.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com
RE: State of the command line?
The command line is a very simple utility for testing and as an example of how to use the API. these are good suggestions, you should document them in a Jira. ben From: burtona...@gmail.com [burtona...@gmail.com] On Behalf Of Kevin Burton [bur...@spinn3r.com] Sent: Saturday, January 03, 2009 7:54 PM To: zookeeper-user@hadoop.apache.org Subject: State of the command line? What's the state of the command line client (ZooKeeperMain) ... it seems a bit fragile. Some thoughts you should probably ship it with a wrapper around rlwrap so that readline supports. You can just detect it in a bash script and enable it if it's installed should be able to get around GPL this way... Also, it should have a %shell% prompt one would think. ... ls should return more like ls -al. If a command fails it shouldn't print the full exception but should instead print just the error (probably to stderr). Though perhaps we won't actually use the shell in production so it doesn't matter as much. Kevin -- Founder/CEO Spinn3r.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com
RE: What happens when a server loses all its state?
Thomas, in the scenario you give you have two simultaneous failures with 3 nodes, so it will not recover correctly. A is failed because it is not up. B has failed because it lost all its data. it would be good for ZooKeeper to not come up in that scenario. perhaps what we need is something similar to your safe state proposal. basically a server that has forgotten everything should not be allowed to vote in the leader election. that would avoid your scenario. we just need to put a flag file in the data directory to say that the data is valid and thus can vote. ben From: thomas.john...@sun.com [thomas.john...@sun.com] Sent: Tuesday, December 16, 2008 4:02 PM To: zookeeper-user@hadoop.apache.org Subject: Re: What happens when a server loses all its state? Mahadev Konar wrote: Hi Thomas, More generally, is it a safe assumption to make that the ZooKeeper service will maintain all its guarantees if a minority of servers lose persistent state (due to bad disks, etc) and restart at some point in the future? Yes that is true. Great - thanks Mahadev. Not to drag this on more than necessary, please bear with me for one more example of 'amnesia' that comes to mind. I have a set of ZooKeeper servers A, B, C. - C is currently not running, A is the leader, B is the follower. - A proposes zxid1 to A and B, both acknowledge. - A asks A to commit (which it persists), but before the same commit request reaches B, all servers go down (say a power failure). - Later, B and C come up (A is slow to reboot), but B has lost all state due to disk failure. - C becomes the new leader and perhaps continues with some more new transactions. Likely I'm misunderstanding the protocol, but have I effectively lost zxid1 at this point? What would happen when A comes back up? Thanks.
RE: Read Write Performance Graphs
That graph is taken from a paper we will be publishing as a tech report. Here is the missing text: To show the behavior of the system over time as failures are injected we ran a ZooKeeper service made up of 7 machines. We ran the same saturation benchmark as before, but this time we kept the write percentage at a constant 30\%, which is a conservative ratio of our expected workloads. -Original Message- From: Stu Hood [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 01, 2008 7:34 PM To: zookeeper-user@hadoop.apache.org Subject: Read Write Performance Graphs Question regarding the current docs: the Reliability in the Presence of Errors graph on http://hadoop.apache.org/zookeeper/docs/current/zookeeperOver.html does not list how many ZooKeeper quorum nodes are in use, or the fraction of reads/writes. I'd like to use the graphic in a presentation, but I'm sure someone will ask these questions. Thanks tons! Stu Hood Architecture Software Developer Mailtrust, a Division of Rackspace
RE: Re: ephemerals handling after restart
Yes Ben -Original Message- From: Johannes Zillmann [mailto:[EMAIL PROTECTED] Sent: Thursday, September 18, 2008 02:22 AM Pacific Standard Time To: zookeeper-user@hadoop.apache.org Subject:Re: ephemerals handling after restart Hi Ben, thanks for your answer! Is the session recoverable in case the zk server was restarted in meantime ? Johannes On Sep 12, 2008, at 3:52 PM, Benjamin Reed wrote: If a application does not close the ZooKeeper session before shutting down, ZooKeeper will not cleanup the session until it times out. So when an application crashes and restarts, ZooKeeper doesn't know if the client is a restart of an old client or a new client. There is a way to alleviate this problem: you can actually maintain a session across client application restarts. If you save off the session id and password, when you restart you can try to reconnect to the session using the ZooKeeper constructor that takes the old session id and password. If the reconnect is successful you can then close the session and get everything to clean up immediately. (Or you could keep using the recovered session if you want to.) ben -Original Message- From: Johannes Zillmann [mailto:[EMAIL PROTECTED] Sent: Friday, September 12, 2008 2:49 AM To: zookeeper-user@hadoop.apache.org Subject: ephemerals handling after restart Hi all, i have a question regarding ephemerals and it behavior on client crash/ restart. We've a master/node cluster similar to a hadoop hdfs cluster but using zk for management. The nodes creates an ephemeral to announce there existence to the master. Now what i recognized is that after stopping the whole cluster and starting only the master again, still some ephemeral nodes may exists for some seconds. That leads me to following questions. What is if a node starts up again. Do it have to clean up it old ephemeral node, or can it somehow acquire the old one ? Just trying to find a best way how to deal with this, since on regular restart of the cluster i often recognize something like this master : node1, node2 connected master : node2 disconnected master : node2 connected again Thanks for any help Johannes ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com